High Availability in SAP HANA Multitier Cost Optimized ScenarioCase Study
Cleber Paiva de Souza <[email protected]>Gabriel Dieterich Cavalcante <[email protected]>S-SYS Sistemas e Soluções Tecnológicas
Who we are
2
• S-SYS was founded in 2014.
• SUSE Partner since foundation.
• Formed by professionals with
experience in all SUSE’s portfolio.
• Formed by certified professionals:
CLP, CLE, CNI, SAP HANA.
3
Aché Farmacêutica
4
Aché, pharmaceutical laboratory 100% Brazilian, founded in 1966, it comes strengthening and expanding your performance in five decades of a success history.
The Company is present in segments Under Prescription, Exempt Medications of Prescription, Generic, Dermatologic and Dermocosmetics. It also has Nutraceutical and Probiotics products.
Nowadays, the portfolio embraces more than 23 medical specialties: they are more than 316 marks in 762 presentations. By 2020, the company will launch more 184 products and 120 in the next three years.
Aché also has participation in Bionovis – which will be inaugurated soon –focused on research, development, production, distribution and commercialization of biotechnical medications.
1st Company to implement SAP APO on HANA in Latin America1st Pharmaceutical Company to implement SAP ECC on HANA Multitenant (MDC)
Example of performance gains (I)
5
Creation of Consolidated files from Accounts:Payable, Account:Recievables and Purchasings
ORACLE
HANA
Example of performance gains (II)
6
Invoice Processing job
Processing time (ms)TimeStamp
SUSE for SAP
7
8
SUSE Linux Enterprise Server for SAP Applications
9
Latest release for x86-64 servers
SAP HANA
10
What’s SAP HANA• In-memory database• Advanced analytics• Versions:
• Basic, Platform and Enterprise Editions• Additional capabilities as add-ons
• Sold as appliance, TDI (Tailored datacenter integration) or in the cloud
• Appliance: by certified hardware partners• TDI: Using customer owned hardware
• Certified SAP HANA appliances without storage as listed in the SAP HANA Hardware Directory
• Certified storage systems as listed in SAP HANA Hardware Certification – Enterprise Storage
• Certified professional (E_HANAINS151) 11
SAP HANA deployment types
12
Single Instance vs Multi-tenant (MDC)• One database per instance.• Available in all SPS (Support
Package Stack) versions.• Indexserver port 3NN15 (NN =
Instance number)• It ’s possible to convert to
MDC.
13
• Multiple databases per instance.• Only in >= SPS09. High isolation
only in >= SPS10.• Indexserver port 3NN40, 3NN43,
3NN99… (NN = Instance number)• Revert not possible, except by
export/import.• Creating tenants:
CREATE DATABASE DB1 SYSTEM USER PASSWORD "Linux123"
SAP HANA System Replication Scenarios
14
15
pacemaker
active / active
System Replication
SAP HANA (PROD) primary
SAP HANA (PROD)
secondary
vIP
PROD PROD
SAP HANA Failover Automation(System Replication)
16
pacemaker
active / passive
System Replication
SAP HANA (PROD) primary
SAP HANA (PROD)
secondary+
QA/DEV
vIP
PROD PROD QA/DEV
SAP HANA Failover Automation(Cost Optimized Scenario)
SAP HANA Failover Automation(Cost Optimized Scenario + Disaster Recovery)
17
pacemaker
active / passive
System Replication
SAP HANA (PROD) primary
SAP HANA (PROD)
secondary+
QA/DEV
vIP
PROD PROD QA/DEV
Site A Site B
PROD
SAP HANA (PROD) DR
System Replication (async)
Log replication modes• mode=sync (Full Sync):
• No data loss.• Available in SPS8 and higher.• Enabled by global.ini -> [system_replication] -> enable_full_sync OR
hdbnsutil -sr_fullsync --enable
• mode=sync:• Data loss can occur, when a takeover is executed while the secondary system is disconnected.• Timeout controlled by global.ini -> [system_replication] -> logshipping_timeout.
• mode=syncmem• Data loss when primary and secondary fails at same time and when a takeover is executed.• Timeout controlled by global.ini -> [system_replication] -> logshipping_timeout.
• mode=async• The most vulnerable to data loss.
18
Operation modes for system replication• deltadata_shipping:
• Establish a system replication (per default every 10 minutes).• Delta data shipping takes place in addition to the continuous log shipping.• Shipped redo log is not replayed on the secondary site.
• logreplay:• Does not require a delta data shipping.• Shipped redo log is continuously replayed on the secondary site.
19
System replication states• UNKNOWN:
• Secondary did not connect to primary since last restart of primary
• INITIALIZING:• Initial data transfer in progress.
• SYNCING:• Secondary is syncing again.
• ACTIVE:• Initialization or sync with primary is complete.
• ERROR:• Error occurred on the connection.
• Monitoring via system tablehdbsql -U userkey 'select distinct REPLICATION_STATUS from SYS.M_SERVICE_REPLICATION'. 20
21
System replication states
Parameters• SAP HANA parameters should be equal between nodes.• Starting with SAP HANA SPS 12 it is possible to automatically
replicate parameters from the primary to the secondary site by activating the following parameter:
• global.ini -> [inifile_checker] -> replicate = 'true'
22
Environment consideration
23
Configuration details• Production database: HANA01 / SID: HPA / Instance number: 01• QA/DEV database: HANA02 / SID: HQA / Instance number: 02• Stonith: IPMI (Production), but for LABs SBD• Cluster resources for SAP: ocf:suse:SAPHanaTopology,
ocf:suse:SAPHana and ocf:heartbeat:SAPDatabase• stonith-enabled="true” (Set as false during configuration)• no-quorum-policy="ignore" • stonith-action="poweroff"• PREFER_SITE_TAKEOVER=”false”
• False: try to restart service locally• True: prefer to takeover to remote site
• AUTOMATED_REGISTER=”false”• False: Former primary instance should NOT register after DUPLICATE_PRIMARY_TIMEOUT• True: Former primary instance should register after DUPLICATE_PRIMARY_TIMEOUT
24
High Availability Configuration
25
Setup SUSE High Availability• Install SAP HANA• Initiate cluster on first node
• Interfaces using bond• Exclusive channel for cluster communication• Openais service disabled at boot• Configure name resolution in SAP HANA for replication network. • global.ini -> system_replication_hostname_resolution
sleha-init
• Change configurations before join second node• Communication via udpu or multicast• Secure channel cluster communication (secauth)• Check files to sync via csync2sleha-join
26
Procedures for SAP HANA System Replication• Certify that SAP HANA Host Agent is installed in all nodes:sapcontrol -nr 01 -function CheckHostAgentSAPHostAgent Installed
• Create an initial full backuphdbsql -u SYSTEM -i 01 “BACKUP DATA USING FILE (’COMPLETE_DATA_BACKUP’)”
• System replication will not work without an initial full backup• For multitenant all tenants should have a full backup
• Setup HDB user store for SAP HANA Host Agent:hdbsql -u SYSTEM -i 01 ”CREATE USER SLEHASYNC PASSWORD Password1”hdbsql -u SYSTEM -i 01 ”GRANT PUBLIC TO SLEHASYNC”
hdbsql -u SYSTEM -i 01 ”GRANT MONITORING TO SLEHASYNC”hdbsql -u SYSTEM -i 01 ”ALTER USER SLEHASYNC DISABLE PASSWORD LIFETIME”hdbsql -u SYSTEM -i 01 ”ALTER USER SLEHASYNC SET PARAMETER PRIORITY = '8'”hdbuserstore SET slehaloc localhost:30113 slehasync Password1
27
Details about SAP HANA System Replication• Validate query• SAP HANA software version of the secondary has to be equal or
newer than the version on the primary• Near zero downtime procedures take this in-place
• Replication information:hdbsql -U slehaloc "select STATUS from M_SERVICE_REPLICATION”
hdbcons -e hdbindexserver "replication info"
28
srTakeover• Set a script to hook takeover operations, adjust SAP HANA
configurations to fit in memory:• When takeover occurs, hook is invoked on secondary:ALTER SYSTEM ALTER CONFIGURATION ('global.ini','SYSTEM') UNSET \
('memorymanager','global_allocation_limit') WITH RECONFIGURE
ALTER SYSTEM ALTER CONFIGURATION ('global.ini','SYSTEM') UNSET \('system_replication','preload_column_tables') WITH RECONFIGURE
• SAP Note 2196941:• https://launchpad.support.sap.com/#/notes/2196941/E
• https://wiki.scn.sap.com/wiki/display/ATopics/HOW+TO+SET+UP+SAPHanaSR+IN+THE+COST+OPTIMIZED+SAP+HANA+SR+SCENARIO+-+PART+I
29
Fixed Bugs
30
SAPDatabase Resource Agent (sapdb.sh)• In Multi-Tenant Databases, SAP HANA start multiple indexserver daemons,
one for each databasehana01:/usr/sap/hostctrl/exe> saphostctrl -function GetDatabaseStatus -dbname HPA -dbtypehdb 01 slehaloc
Database Status: Warning
Component name: hdbdaemon (HDB Daemon), Status: Running (Running)
Component name: hdbcompileserver (HDB Compileserver), Status: Running (Running)
Component name: hdbnameserver (HDB Nameserver), Status: Running (Running)
Component name: hdbpreprocessor (HDB Preprocessor), Status: Running (Running)
Component name: hdbwebdispatcher (HDB Web Dispatcher), Status: Running (Running)
Component name: hdbindexserver (indexserver-DB1), Status: Running (Running)
Component name: hdbindexserver (indexserver-DB2), Status: Running (Running)
Component name: hdbindexserver (indexserver-DB3), Status: Running (Running)
Component name: hdbconnectivity (HDB Connectivity), Status: Running (connect possible)
Component name: hdbalertmanager (HDB Alertmanager), Status: Warning (alerts on database.) 31
SAPDatabase Resource Agent (sapdb.sh)• Resource agent uses a command to get status from monitored
services, as specified in MONITOR_SERVICES• sapdb.sh was crafted to parse output based only on service name
and check if status is “Running”hana01:/usr/sap/hostctrl/exe> echo "$output" | grep -i
"Component[ ]*Name *[:=] *hdbindexserver (" |
sed 's/^.*Status *[:=] *\([A-Za-z][A-Za-z0-9_]*\).*$/\1/i'
Running
Running
Running
#
• Now we have N services with the same name, which causes a “Running Running Running ...” status
• Pull request sent to ClusterLabs (#858), merged on Oct 12, 201632
Troubleshooting
33
Stress testing• HANA Stress Tool
• https://github.com/Centiq/HanaStress• ./hanastress.py -v --host localhost -i 01 -u SLEHASYNC \
-p Password1 -g anarchy --tables 200 \
--rows 1000000 --threads 20
(This will create 200 tables with 1000000 rows of information each, using 20 threads)• After stress:- Remove database fragmentation:
ALTER SYSTEM RECLAIM DATAVOLUME 120 DEFRAGMENT
ALTER SYSTEM RECLAIM LOG- Force flushing log data to disk:
ALTER SYSTEM SAVEPOINT
34
Dectecting failures• /var/log/messages
• Inspect this file to see cluster actions about SAP HANA operations• All environment messages will be on it• HanaSR consists of editable scripts: /var/lib/ocf/• You can increase the verbosity of bash scripts by adding ”-x” to the shebang
• hb_report• hb_report -u root -f "2016/02/13 08:45" -t "2016/02/13 10:45" /tmp/inicident
• Monitor SAP HANA logs (look for *.crashdump files)• When SAP HANA have problems to run, or one of services hangs, it will write a
crashdump• The files are inside SAP HANA instance files folder
35
Failover SAP HANA
36
Operations on primary node• Stop primary node for maintenance:
• Usually is better make a graceful shutdown in Openais/Pacemakerrcopenais stop
• After a failure or back from maintenance: • It's better to rely on ad hoc sync first (HanaSR operations can be time-consuming to put
primary on SOK status and take PROD back, avoid split-brain) • Cleanup all replication status (primary node):hdbnsutil -sr_cleanup -force
37
Operations on primary node• Reestablishing synchronization:
• Register primary as slave node:hdbnsutil -sr_register --remoteHost=<hostname> \
--remoteInstance=<SID> --mode=syncmem --name=<SITENAME>• Start HANA, wait until sync, then stop it• Start OpenAIS
• It'll start Hana and replication, it will migrate HANA Production to primary node after a while
38
Operations on second node• Simple, because no resource reallocation is needed.• Stop/Start openais should be fine.
39
DR (Disaster Recovery) considerations
40
Operations DR Node• Enable secondary node as sync sourcehdbnsutil -sr_enable
• Start DR replication:hdbnsutil -sr_register --remoteHost=<remote_hostnam> \
--remoteInstance=<SID> --mode=async \
--operationMode=<delta_shipping|logreplay> --name=<DR_NAME>
• Stop DR replication:hdbnsutil -sr_unregister --name=<DR_NAME>
• Avoid to shutdown DR node without unregister:• HanaSR use HANA commands to monitor the system replication• When a node is unreachable these commands can take more than one minute to output.• With this latency, some cluster operations can be timed out.
41
References
42
Related SUSECon Presentations• CAS89126 - SAP HANA High Availability with SUSE HA: Tales of
clustering from the real world• FUT92716 - SUSE Linux Enterprise Server for SAP Applications
Roadmap• SPO98283 - Scaling Your SAP HANA Data Warehouse with Lenovo
and SUSE• TUT89539 - SUSE High Availability for SAP HANA TDI in a VMware
Environment• TUT90846 - Towards Zero Downtime - How to Maintain SAP HANA
System Replication Clusters• TUT91496 - Live Patching Demo: Keep SAP Running When Patching
the Linux Kernel43
SAP Notes• SAP Note 2165547 - FAQ: SAP HANA Database Backup & Recovery
in an SAP HANA System Replication Landscape• SAP Note 1999880 - FAQ: SAP HANA System Replication• SAP Note 1702224 - Disable password lifetime for technical users• SAP Note 2222250 - FAQ: SAP HANA Workload Management
44
References• SAP HANA Administration Guide:
http://help.sap.com/hana/SAP_HANA_Administration_Guide_en.pdf• http://help.sap.com/hana/SAP_HANA_Server_Installation_Guide_en.p
df• https://help.sap.com/saphelp_hanaplatform/helpdata/en/54/01f498b2c
84fb5b3bcdcbda948d991/content.htm• https://help.sap.com/saphelp_hanaplatform/helpdata/en/74/418e86b48
542ffb38b54072e0b66ce/content.htm• https://www.suse.com/docrep/documents/19rp0i23ol/sap_hana_sr_per
formance_optimized_scenario_11_sp4.pdf
45
Question?
46
Thank you!
47