Very Large DB2 pureScaleimplementation sharingYunCheol Ha, IBM Australia
GyouByoung Kim, IBM Korea
Session Code: 4063
Date and Time of Presentation
Platform: DB2 for Linux, UNIX, Windows
2
Agenda
• Why DB2 pureScale
• Business requirements
• Technical challenges
• Solutions
• Architecture
• Workloads
• Database configuration
• Migration and consolidations
• Lessons & Learned
3
Why DB2 pureScale ?
• Continuous high availability• 24 x 7 x 365
• Linear Scalability • Data explosion and workload
• Simple cluster management• Automation
• Capacity On demand
• Application transparency
• Database Consolidation• Multi-tenancy
• Mixed workload
4
Business requirements
• A rapid change on customer services environment from provider centric services to consumer based services
• Enhancement of customer experience• Integrated single system for simple usability
• Mobile portal support
• 24x7x365 services
• Systematic compliance support• Reduction of the opportunity for irregularities and corruption
• Reduction of the cost of compliance
• Improvement in Productivity of internal staffs• Administrative efficiency
• Simplicity of process
• Agility of systems along with rule amendments
5
Technical Challenges
• Poor Customer services due to aging and complexity of IT infrastructure
• Data integrity, duplicated customer information across multiple systems
• Difficulty in managing heterogeneous systems and maintaining consistent performance due to complexity of technology applied
• Expectations on newly integrated systems• Efficient management and operations on very large data
• Consistent performance during seasonal peak workloads
• Flexible capacity management under workload seasonality
• Mixed workload
• Consolidation of 40 applications
• Bulk data load and batch workload against very large tables during online transaction operation
• Proven technologies and deployment experiences of very large DBMSs
6
Solutions
Operational DBMSDB2 pureScale for highly available and scalable operational systems
H/WRobust and stable AIX and Power 7 with Capacity On Demand(COD) and 10GE RoCE interconnect
Storage copy Storage snapshot copies for Very Large Database(VLDB) backups and daily batch processes
Disaster Recovery Synchronous Storage mirroring
DW DBMSPureData Operational Analytics for Enterprise Data Warehouse
Replication CDC from DB2 pureScale to PDOA
Data MigrationDB2 Data Partitioning Feature and federation technology for data consolidation and migration
7
Architecture > System Configurations
Internal Domain
Member130 cores
Member230 cores
Member330 cores
Dev QA
AIXTSA/GPFS
pureScale ClusterDB2 AESE
3rd party tool
AIXDB2 AESE
tools
AIXDB2 AESE
3rd party tools
Member12cores
Member22cores
Single2cores
Production servers
Public Domain
Member130 cores
Member230 cores
Member330 cores
Dev QA
AIXTSA/GPFS
pureScale ClusterDB2 AESE
3rd party tool
AIXDB2 AESE
tools
AIXDB2 AESE
3rd party tools
Member12cores
Member22cores
Single2cores
Production servers
Other biz server
AIXDB2 AESE
3rd party tools
Member135 cores
Member235 cores
Other biz server
AIXDB2 AESE
3rd party tools
Member18 cores
Member28 cores
• Total 17 pureScale Clusters• 2 main Very Large Database(VLDB) clusters with 3 members each
8
Architecture > various pureScale cluster Topology
Shared Storage
LogData
DB #1
Global LockGlobal Buffer
DB #2 DB #3
DB2 pureScale Cluster
Ethernet Switch
SAN Switch
Interconnect Switch
Application Servers Application Servers Application Servers
DB #1 DB #2 DB #3
DB2 pureScale Cluster
Member Member Member
CF CF
DB #1 DB #2
DB2 pureScale Cluster
Member Member
CF CF
DB #1 DB #2 DB #3
DB2 pureScale Cluster
Member Member MemberCF
CF
3 Members + 2 CFs Dedicated
3 Members + 2 CFs collocated
DB #1 DB #2
DB2 pureScale Cluster
Member MemberCF
CF
2 Members and 2 CFs dedicated
2 Mem. and 2 CFs collocated
DB2 pureScale Cluster
DB #1
Member
CF
Logical Members
DB #2
Member
DB #3
Member
9
Architecture > Application configuration
• A consolidated database on pureScale for multiple applications
• DB2 and WAS Workload balance(WLB) and client affinity setup for
• Online applications on member 0 and 1
• Online Batch applications on member 2
• Consideration on Member Subset
• Automatic client reroute (ACR) setup for java and non java applications
Shared Storage
LogData
DB #1
Global LockGlobal Buffer
DB #2 DB #3
DB2 pureScale Cluster
Ethernet Switch
SAN Switch
Interconnect Switch
Application Servers Application Servers Application Servers
10
Architecture > Network Topology
Storage
Public NIC Switch #2Public NIC Switch #1 10G RoCE Switch #1
Private NIC Switch #1
10G RoCE Switch #2
Private NIC Switch #2
SAN Switch #1 SAN Switch #2
RoCE(10G)
GPFS(1G)
RoCE(10G)
RoCE(10G)
RoCE(10G)
HBA(8G)
HBA(8G)
GPFS(1G)
RoCE(10G)
GPFS(1G)
RoCE(10G)
RoCE(10G)
RoCE(10G)
CPU : 16CMEM:160G
CF #1CPU : 16CMEM:160G
CF #2
HBA(8G)
HBA(8G)
GPFS(1G)
PubNic(10G)
RoCE(10G)
GPFS(1G)
GPFS(1G)
RoCE(10G)
PubNic(10G)
HBA(8G)
HBA(8G)
HBA(8G)
HBA(8G)
RoCE(10G)
RoCE(10G)
Pub(10G)
RoCE(10G)
GPFS(1G)
GPFS(1G)
RoCE(10G)
Pub(10G)
HBA(8G)
HBA(8G)
HBA(8G)
HBA(8G)
RoCE(10G)
RoCE(10G)
Pub(10G)
RoCE(10G)
GPFS(1G)
GPFS(1G)
RoCE(10G)
Pub(10G)
CPU : 30CMEM:224G
DB #3CPU : 30CMEM:224G
DB #1CPU : 30CMEM:224G
DB #2
HBA(8G)
HBA(8G)
HBA(8G)
HBA(8G)
RoCE(10G)
RoCE(10G)
• Redundancy on RoCE adapters and switches of interconnect network, 10Gbit adapters and switches of GPFS private network, and public network.
• Storage Area network(SAN) connection for shared storage
11
Architecture > Multiple RoCE Switches configuration
10G Switch 1 10G Switch 2
Mbr #1CFp
8
Mbr #2 Mbr #3 CFs
en0 en1 en2 en3 en4 en5 en6 en7 en0 en1 en2 en3 en4 en5 en6 en7
2 Switches : 20port + 16port = 36ports
Host1 Host2 Host3Host4 Host5
en0 en1 en2 en3 en4 en5 en6 en7 en0 en1 en2 en3 en4 en5 en6 en7 en0 en1 en2 en3 en4 en5 en6 en7
• High Availability and performance
• 10 RoCE adapter ports per switch
• 8 Inter Switch links(ISL) per switch
• Number of ISLs RoT = ( total number of CF interconnects + the number of member )/2
ISL=8
12
Architecture > Backups and Database Clone
• Fast Storage snapshot without impact on online transactions• VLDB database backup
• A point in time database clone
• Around 40TB including 20TB compressed data
Backu
p/R
esto
re
Snapshot
Snapshot
Large Batch WorkloadOLTP Workload
DatabaseBackup
DB #1 DB #2
DB2 pureScale Cluster
Mem-0 Mem-1CF CF
LogData
DB #1
Mem-0
CF
Mem-1
LogData
DB #1
Mem-0
CF
Mem-1
LogData
13
DB #1
Mem-0
CF
Mem-1
LogData
Architecture > Database clone• Daily batch processing on the cloned database
• Logical members and a CF on a single server
• Snapshot on only data and transactional log files
• Database activation on only one member
DB #1 DB #2
DB2 pureScale Cluster
Mem-0 Mem-1CF CF
LogDataSnapshot
Logical Members and FCM Ports
14
Data Migration > Data Migration & Consolidations
As-Was Systems
Data Source : File
Migration & New System transformation
DB2 DPF
Master dataData Qualitymanagement
Data Standardization
To-Be Model
Master Data ManagementIntegrated new portal
Operationaldata
GIS,BPM,CRM
New Systems
1
2
3
Data source : 40 data source migrationDB2 : federation (size 30%) Oracle : federation (Size : 60%)MSSQL : audit : FileCubrid and etc. : legal , Files
Cubrid
Legal Info
MSSQL
Audit
DB2
DW
OracleDB2
Support and management
Other operational system
DB2
Internal System
Public System
EDW System
PDOA
60TB
15
Data Migration > Data Migration & Consolidations
• Large data volume migrations within limited window of time• Total 50 ~ 60TB data source systems in Oracle, Sybase, DB2, etc
• Parallel Bulk data processing in the staging area using DB2 Database Partition Feature(DPF)• Data consolidations in DB2 DPF
• Source and target table mappings
• Source data extraction methods
Cursor load from Oracle and DB2 using federation server
Export and load from Sybase
• Table and SQL designs utilizing collocation joins maximizing DPF performance
• Pre cold/history data migration during online
• Offline data migration for active data to minimize downtime time
16
Workloads > Transactions
• Read & write ratio• 20 vs 80 read & write ratio during the
night
• 85 vs 15 read & write ratio during the day
• Transactions • Daily 360M commits and 1300M SQLs
• 4K commits per second
• 15K SQLs per second
• At peak hour, 30M commits and 100M SQLs
• 8K commits per second
• 30K SQLs per second
17
Workloads > Transactions
• Mixed workload• Substantial rows read from complex queries for Online reporting
along with short millisecond lookup query and write operations
• Heavy read workload on member2 used for batch processing
• Database connections• Average 19K connections from 3 members
18
Workloads > System utilization during peak
• Workload balance• Online transaction workload on
member 0 and member 1
• Online batch processing on member 2
• Application server level Workload balance across two members
• Seasonal workload spikes• Capacity On Demand to deal with
the seasonal workload
Member0
Member1
Member2
19
Database Configuration > Database Manager CFG
• NUMDB 1 • One critical main database per instance
• Simple CF related memory configurations for one database
• CF_MEM_SZ Automatic
• CF_NUM_WORKER Automatic
• CF_NUM_WORKER is set based on best practices • At least the number of interconnect ports
• 4 RoCE adapter ports per member or cf
• One or two less than the number of CF cores
• A dedicated LPAR for each CF
• CPU usage monitoring using db2pd –cfinfo or ENV_CF_SYS_RESORUCES administrative view
20
Database Configuration > Database CFG
• LOCKLIST size is around 1GB
• MAXLOCKS is set with 1• Extremely low MAXLOCKS parameter value to discourage
developers to develop bulk data processing SQLs
• SHEAPTHRES_SHR vs Sortheap size • 100 : 1 ratio
• Many concurrent users and substantial number of complex queries
• PCKCACHESZ • Due to complex queries, parameter markers couldn’t be used
• Optimal package cache size was assigned to reduce compilation time for huge number of SQLs
21
Database Configuration > Table space and table• Multiple automatic storage groups to handle a tens Terabytes
database• 4 storage paths for each automatic storage group considering IO
performance and balanced configurations
• 4 containers per table space
• Over 1000 table spaces to store around 10,000 tables
• Big tables are defined as range partitioned tables• Each partition is stored in its own table space
• Fast Roll in and Roll out
• Consideration on online operations and management
• Compression was turned on for all the tables and indexes • 70% table compression ratio
• 50% index compression ratio
22
Lessons & Learned > GPFS configuration
• Most of GPFS parameters are pre-optimized for pureScale
• Tuning on AIX AIO parameters to avoid contentions among AIO servers• The number of aio_maxservers processes in a AIX server is the
number of logical cores times the value of aio_maxservers parameter
• For example, under 30 physical cores, SMT4 enabled, and aio_maxservers= 30, the number of aio_maxservers processes are 3600 : 30 x 4 ( SMT4 ) x 30
• AIO server Monitoring : nmon Captial A
aio_maxservers = 1aio_minservers = 1
Default : aio_maxservers = 30Default : aio_minservers = 3
Lessons & Learned > Syslog configuration
• Syslog setup was recommended for troubleshooting
• Syslog integrates log data from many different types of systems• the RSCT, TSA, and DB2 create log messages, which could be
helpful for pureScale diagnosis, in the syslog daemon
• On AIX the syslog daemon isn’t configured and running as default. • vi /etc/syslog.conf
• *.debug /var/log/syslog.out rotate time 1d files 7 ( as root )
• touch /var/log/syslog.out ( as root )
• refresh –s syslogd (as root)
• http://www-01.ibm.com/support/docview.wss?uid=swg21302886
24
Lessons & Learned > RoCE switches Configuration
• 2 RoCE switches were setup for high availability and the RoCEswitch failover was configured.• Disabling Converged Enhanced Ethernet(CEE) feature
• Enabling Global Pause (IEEE 802.3x) flow control to avoid dropped packets
#interface port <first port>-<last port>
#flowcontrol both
• Disabling spanning tree protocol (STP)
#spanning-tree mode disable
• Enabling Link Aggregate Control Protocol (LACP) on inter switch links(ISLs) on each switch to remove network loops
#interface port <port number>
#lacp mode active
#lacp key <any available key, 1 is ok if there is no previous config>
#exit
…
Lessons & Learned > uDAPL_ping• RDMA network health check
• uDAPL_ping script validates uDAPL connectivity through uDAPL ping for every listed HCA and cluster interconnect netname combination specified
• Download udapl_ping.zip from IBM
• uDAPL ping validation • Create a host-hca file including list of available HCAs from /etc/dat.conf and interconnect netnames
of pureScale hosts
• validateUdaplPing <host-hca-file> <dat-version>
• uDAPL_ping valididation example :
• Interconnect performance can be monitored through the CrossInvalidate message send time of MON_GET_CF_CMD CF commands table function, and • Average 10 microseconds: in good health
validateUdaplPing host-hca 2.0100 bytes from 10.10.1.1: seq=0 time=82495100 bytes from 10.10.1.1: seq=1 time=22100 bytes from 10.10.1.1: seq=2 time=22100 bytes from 10.10.1.1: seq=3 time=22100 bytes from 10.10.1.1: seq=4 time=23round-trip average: 16516uDAPL ping from HostA-en1-1 (client) to HostA-en1-1 (server) was successful……
vi host-hcaHostA HostA-en1-1 hca0HostA HostA-en1-2 hca1HostB HostB-en1-1 hca0HostB HostB-en1-2 hca1
Lessons & Learned > Interconnect performance• Interconnect performance can be monitored through the CrossInvalidate message
send time and other CF commands execution time of MON_GET_CF_CMD CF commands table function• Around or less than 10 microseconds indicates a good health
• MON_GET_CF_WAIT_TIME table function also provides interconnect transport time and CF command execution time. • Around or less 100 µs indicates a good health
27
Lessons & Learned > Member & CF Collocation • A dedicated LPAR for a CF is recommended in AIX environment
• However, when a member and a cf need to be collected in AIX RoCEenvironment, CPU binding is recommended
• 80-20 rule is applied to logical cores assignment between a member and a CF processes
• CF_NUM_WORKDERS = logical cores of a CF – 1
• CPU affinity setup using rset• Create rsets for the member and the cf
mkrset –c 0-20 pscale/memberrest
mkrset –c 21-26 pscale/cfrset
• Start db2 database manager
db2start
• Bind the db2 member processes to the member rset
ps –ef |grep db2
Attachrset pscale/memberrest <db2 process number>
• Bind CF processes to the CF rset
ps –ef |grep ca
Attachrset pscale/memberrest <cf process number>
28
Lessons & Learned > Database backup & clone
Step1db2 flush bufferpool alldb2 set write suspend for database
Step2/usr/lpp/mmfs/bin/mmfsctl filesystem suspend-write (Data & Logs)Snapshot copy (storage copy)
Step3/usr/lpp/mmfs/bin/mmfsctl filesystem resumedb2 set write resume for database
Clone Database DB2 backup from the snapshot image
Step1Attach and mount the snapshot copy to a clone server
Attach and mount the snapshot copy to a backup server
Step2 db2start db2start
Step3 db2inidb <dbname> as snapshotdb2 backup db <dbname> to /Filesystem or to Backup Library
• Database clone and DB2 backup from the snapshot copy
• Snapshot Storage copy
No Tasks Target Commands
Pre
Check
list
Copy DB2 fixpak image All Hosts tar –xvf / IBM/db2105fp4/ v10.5fp4_aix64_server_t.tar
Verify Minimum committed code level All Hosts pure1[root]:/IBM/db2105fp4/server_t>./installFixPack -show_level_info
check Free Disk Space All Hosts /opt (6300000KB), /tmp (2000000KB)
verify DB2 FixPack version db2level
Verify tsamp version All Hostspure1[root]:/IBM/db2105fp4/server_t/db2/aix/tsamp>./db2cktsa -v install
pure1[root]:/IBM/db2105fp4/server_t/db2/aix/tsamp>./db2cktsa -v media
verifty gpfs version All Hostspure1[root]:/IBM/db2105fp4/server_t/db2/aix/gpfs>./db2ckgpfs -v install
pure1[root]:/IBM/db2105fp4/server_t/db2/aix/gpfs>./db2ckgpfs -v media
1 Install Online Fixpak ->Member, CF-S, CF-P order All Hostspure1[root]:/IBM/db2105fp4/server_t>./installFixPack -p /opt/IBM/db2/V10.5.4
-I db2 -online -l /tmp/fp4install.log -t /tmp/fp4install.trc
2 Determine the success of the online fixpak update All Hosts pure1[root]:/IBM/db2105fp4/server_t>./installFixPack -check_commit -I db2
3 Commit the online fixpak updatepure1[root]:/IBM/db2105fp4/server_t>./installFixPack -commit_level -I db2 -l
/tmp/fp4install.log -t /tmp/fp4install.trc
4 Verify Fixpak version pure1@db2:/home/db2>db2pd –rustatus
Lessons & Learned > Online fixpak update
• Fixpak update from fixpak3 to fixpak4 during online operations in order of members and CFs and CFp servers
No Tasks Target Commands
Pre Same as fixpak rolling update
1 Stop database manager pure1@db2:/home/db2>db2stop
2 Stop DB2 instance All Hosts db2stop instance on pure1
3 Install Fixpak All Hostspure1[root]:/IBM/db2105fp4/server_t>./installFixPack -p /opt/IBM/db2/V10.5.4 -
I db2 -offline -l /tmp/fp4install.log -t /tmp/fp4install.trc –f TSAMP –f GPFS
4When db2instance –list shows inconsistent state, refresh the
resource modelpure1@db2:/home/db2>db2cluster –cm -repair –resources
5 Determine the success of the online fixpak update All Hostspure1[root]:/IBM/db2105fp4/server_t>./installFixPack -check_commit -I db2 -t
/tmp/checkcommit.trc -l /tmp/checkcommit.log
6 Commit the fixpak updatepure1[root]:/IBM/db2105fp4/server_t>./installFixPack -commit_level -I db2 -l
/tmp/commitlevel.log -t /tmp/commitlevel.trc
7 Restart db2 instance All Hosts pure1@db2:/home/db2>db2start instance on pure1
8 Restart database manager pure1@db2:/home/db2>db2start
Lessons & Learned > Offline fixpak update (simplified)
• Offline Fixpak update to apply special build during system maintenance time in order of members and CFs and CFpservers
No Tasks Commands
All the hosts in order of Members, CFs, and CFp
1 Quiesce a Member $ db2stop member 0 quiesce 30
2 Stop db2 instance on the member host $ db2stop instance on pure1
3 Enter Cluster manager Maintenance mode #/opt/IBM/db2/V10.5.4/bin>./db2cluster -cm -enter –maintenance
Host 'pure1' has entered maintenance mode.
4 Enter shared file system cluster Maintenance mode #/opt/IBM/db2/V10.5.4/bin>./db2cluster -cfs -enter –maintenance
Host ‘pure1’ has successfully entered file system maintenance mode.
System maintenance
5 Exit cluster manager Maintenance Mode # opt/IBM/db2/V10.5.4/bin>./db2cluster -cm -exit –maintenance
Host 'pure1' has exited maintenance mode. Domain
'db2domain_20141025220444' has been started.
6 Exit shared file system cluster Maintenance Mode # opt/IBM/db2/V10.5.4/bin>./db2cluster –cfs -exit –maintenance
Host 'pure1' has successfully exited file system maintenance mode.
7 Restart the instance $/home/db2>db2start instance on pure1
8 Restart the member $/home/db2>db2start member 0
Lessons & Learned > Online System maintenance
• Online system maintenance to apply HW configuration changes
Lessons & Learned > Websphere Application Server
• Workload Balance(WLB) and Automatic Client Reroute(ACR) setup• WLB and ACR configuration using JCC data sources to connect DB2 pureScale cluster
from WAS version 7.0.0.9 or later
• Data source custom properties :• Driver type: the JCC driver type 4.
• Database name : the pureScale database name
• Server name: hostname of one DB2 pureScale connect member
• Port number: TCP/IP port number of the DB2 pureScale connect member
• enableSysplexWLB: true to enable WLB
• Dynamic Alternate Server list
• db2.jcc.outputDirectory
• Or Static Alternate Server list
• clientRerouteAlternateServerName: the server list separated by commas.
• clientRerouteAlternatePortNumber: TCP/IP port list separated by commas for the servers specified in clientRerouteAlternateServerName.
• Data source connection pool properties:• Purge Policy: set to FailingConnectionOnly for WAS to support seamless ACR
Lessons & Learned > Websphere Application Server
• maxTransportObjectidleTime may be needed tune.
• Under WLB, each logical connection from WAS may have a physical connection ( aka Transport. One transport object per one physical connection ) to every single member.
• The number of Transport object is controlled by maxTransportObjectsproperty and the transport objects can be dropped when the transport objects are idle more than the seconds defined by maxTransportObjectidleTime. The default value is 10 secs.
• Dropping the idle transport objects from a member could cause a performance degradation because no idle transport objects are available on the member when they are needed for new connections
• 3600 secs were assigned to maxTransportObjectIdleTime to avoid frequent transport objects drop.
Lessons & Learned > Distributed Shell (Dsh)• Simplifying the task of managing multiple pureScale AIX hosts
• Dsh commands distribute the same commands on all the pureScale hosts
• Dsh setup :1. Install AIX version csm.dsh.1.6.0.0.bff
2. Add node list
vi /.profile
export WCOLL=/node_list
3. Cat node_list
[pure1] root:/> cat node_list
pure1
pure2
pure3
pure4
• Example :• pure1[root]:/>dsh -c
• dsh> date
• pure1: Sun Nov 16 18:42:52 2014
• pure4: Sun Nov 16 18:42:52 2014
• pure3: Sun Nov 16 18:42:52 2014
• pure2: Sun Nov 16 18:42:52 2014
• dsh> exit
35
• Impressive DB2 pureScale performance for complex queries without much tuning efforts, which were quite slow before.
• Most of SQLs are running well when the statistic information of database objects is updated properly and appropriate indexes are created on them, but few SQLs require tunings, and this requires deep knowledge and experience on the DB2 optimizer.
• Preparation of prerequisite of H/Ws and OS S/Ws is the most important step for the pureScale installation
• Easy monitoring and problem determination & solving of pureScale cluster services ( RSCT/TSA/GPFS ) are needed
Feedbacks
YunCheol HaIBM [email protected]
Very Large DB2 pureScaleimplementation sharing
Please fill out your session
evaluation before leaving!
GyouByoung KimIBM [email protected]