Jumpstart: Monitoring and Tuning MarkLogic · 2018. 9. 29. · Primarily GUI -based but can export...

Post on 14-Sep-2020

2 views 0 download

transcript

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Silvano Ravotto and Marc Chiarini, MarkLogic Performance Engineering

JUMPSTART: MONITORING AND TUNING MARKLOGIC

SLIDE: 2

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Agenda • Introduction

• Methodologies

• Monitoring

• Tools overview

• Platform specific tools

• Tuning

INTRODUCTION

SLIDE: 4

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Going Back in Time

SLIDE: 5

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Picture = 34,140 words

SLIDE: 6

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Application

Looks Familiar?

OS

Virtualization

CPU Disk Memory Network

Third Party Software

SLIDE: 7

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

“Or se' tu quel Virgilio?”

http://www.brendangregg.com/

Great resource on the world of systems performance

“Systems Performance: Enterprise and the Cloud” (Book)

METHODOLOGIES

SLIDE: 9

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Methodologies

Procedures to analyze system or application performance

Provide a starting point

Different methodologies are suited for solving different classes of issues

Try more than one before accomplishing your goal

SLIDE: 10

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Some Good Methodologies Whys Method

Drill-Down Analysis Method

USE Method ‒ For every resource, check:

Utilization

Saturation

Errors

OBSERVABILITY TOOLS

SLIDE: 12

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Observability Tools

In order to characterize performance, you need to observe it.

(Trust Nothing)

SLIDE: 13

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Observability Tools Application Tools

MarkLogic Monitoring History

Infrastructure Tools

Amazon CloudWatch, Performance Co-Pilot, Nagios, NewRelic

OS Tools

Linux, Windows

SLIDE: 14

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

MarkLogic Monitoring History (Application)

SLIDE: 15

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

MarkLogic Monitoring History

MarkLogic server view

Covers several resources

Disk, CPU, Memory, Network, Server, Database

Can’t be used alone

MarkLogic process overloaded => incomplete picture

SLIDE: 16

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Amazon CloudWatch (Infrastructure)

SLIDE: 17

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Amazon CloudWatch

Collects and tracks metrics and log files

Set alarms

Monitors AWS resources

Amazon EC2 instances, EBS, ELB

Supports custom metrics generated by applications and services

SLIDE: 18

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Performance Co-Pilot (Infrastructure)

PCP is a framework for system-level performance analysis

Collection, monitoring, and analysis of system metrics

Available on Linux, Windows, Mac OS

Easily extendable and flexible

MarkLogic PMDA - https://github.com/sravotto/marklogic-pcp-pmda

SLIDE: 19

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Sample PCP Commands pmstat

High level metrics (like vmstat)

pminfo

List all the known metrics

pmdumptext

Dump performance metrics in ASCII table

SLIDE: 20

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Sample PCP Graphical Tool

Linux Tools

SLIDE: 22

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Linux Performance Observability Tools

SLIDE: 23

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Perf (Linux Kernel 2.6+) perf top

Generates and displays a performance counter profile in real time

perf record -F 99 -ag -- sleep 60

Record overall activity for 60 seconds

perf report --stdio

Display the activity

SLIDE: 24

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Perf Report Example 80.54% 0.23% MarkLogic MarkLogic [.] xdmp::LetClause::flworEval | ---xdmp::LetClause::flworEval | |--95.87%-- xdmp::LetClause::flworEval | | | |--97.76%-- xdmp::LetClause::flworEval | | | | | |--99.79%-- xdmp::LetClause::flworEval | | | | | | | |--93.91%-- xdmp::FLWORExpr::eval | | | | xdmp::IfExpr::eval | | | | xdmp::EvalFLWOREnv::evalReturn | | | | xdmp::LetClause::flworEval | | | | xdmp::FLWORExpr::eval | | | | xdmp::FunctionCall::eval | | | | xdmp::WithNamespacesExpr::eval

SLIDE: 25

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Sample Flame Graph (Ingestion) Saving

In-Memory Stand

Merges

Parsing + Indexing

CASE STUDY

SLIDE: 27

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Customer Case Study Main Use Cases

Raw ingest

Transformations of content

Extraction

Problem Statement

Sharp performance degradation of ingestion with increasing database size

SLA not longer met

# sar –u

12:00:01 AM CPU %usr %nice %sys %iowait %steal %irq %soft %guest %idle

04:33:01 PM all 8.01 0.06 3.00 31.91 0.00 0.00 0.20 0.00 56.83

04:34:01 PM all 4.40 0.06 1.83 44.48 0.00 0.00 0.14 0.00 49.10

04:35:01 PM all 4.68 0.05 1.89 39.08 0.00 0.00 0.15 0.00 54.15

CPU Metrics

CPU is mostly waiting for I/O. Why?

# sar -d

12:00:01 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util

04:34:01 PM dev8-0 4984.60 120171.70 7192.76 25.55 117.69 23.61 0.21 104.57

04:35:01 PM dev8-0 4215.76 126061.58 6838.80 31.52 96.93 22.99 0.22 91.28

04:36:01 PM dev8-0 4126.01 135221.99 6944.88 34.46 95.97 23.28 0.22 90.15

04:37:02 PM dev8-0 4135.01 137442.28 7359.64 35.02 96.92 23.44 0.22 89.76

I/O Metrics

Reads are dominating writes during ingest. Why?

Thread A

pread64()

svc::StandardRandomReadResult (…)

svc::StandardRandomReadTask::run()

svc::PooledThread::run()

Execution Stack Analysis

Execution stack (pstack) shows threads waiting on random reads.

Threads waiting on I/O originate from cts:uris. Why?

Thread B

sem_wait,svc::Semaphore::wait(bool)

svc::StandardRandomReadResultFuture::_finish()

xdmp::cts_uris(…)

let $uri := cts:uris((), (), cts:element-value-query(xs:Qname(“ID”), $id))

return

if ($uri)

then update (...)

else insert (...)

Code Inspection

We found cts:uris calls had an element value query with a document ID.

Why is this impacting performance?

SLIDE: 32

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

ID Lookup Using the Universal Index

HASH(ID) =A34E0

ID

List Index (One for each Stand)

A34DE-A34EA A34E0 ?

MEMORY

List Data (One for each Stand)

DISK (Term List)

Worst Case Disk Reads (Node) = Ingestion Rate * Number of Stands

SLIDE: 33

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Tuning Advice

Avoid I/O to get the term list

With 20 stands/node and 500 DPS = 10,000 read operations per second!

Leverage a range index for these unique IDS

All the values are in memory

Change cts:element-value-query to cts:element-range-query

SLIDE: 34

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Case Study Recap

Analyze each layer

Ask yourself why you see a particular behavior

Collect additional metrics as you drill down

Are the metrics telling you a different story?

Make changes once you understand the issue

Only one change at a time

WINDOWS TOOLS

SLIDE: 36

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Windows Performance Analysis Tools Primarily GUI-based but can export important data

Progressively more powerful and detailed:

Procexp (Process Explorer) – task manager

Procmon (Process Monitor) – record system calls and thread activity

WPA (Windows Performance Analyzer) – fine-grain detail on CPU, memory, disk, network activity, etc

SLIDE: 37

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Procexp (Process Explorer) Comprehensive replacement for Windows Task Manager

Gives a sense of performance / problems in the moment.

Available at https://technet.microsoft.com/en-us/sysinternals/processexplorer.

SLIDE: 38

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Procexp Main Window

SLIDE: 39

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

MarkLogic Resources

SLIDE: 40

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Performance Stats

SLIDE: 41

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Disk & Network Stats

SLIDE: 42

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Detailed Resources

SLIDE: 43

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

SLIDE: 44

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

SLIDE: 45

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Procmon (Process Monitor) Records and filters Windows system call statistics

Can also do thread profiling

Useful for finding abnormal conditions/errors that can cause performance problems

Available at https://technet.microsoft.com/en-us/sysinternals/processmonitor

SLIDE: 46

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Procmon Main Window

Start / Stop Capture

Start / Stop Autoscroll

Clear Log

Show Filesystem

Events

Show Proc/Thread

Events

SLIDE: 47

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Resources Used by Visible Events

SLIDE: 48

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Occurrences of Column Values

SLIDE: 49

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Filesystem Activity Summary

SLIDE: 50

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Procmon Workflow Capture a short trace

Determine event classes of interest

Capture a longer trace

Filter out disinteresting events such as successful system calls

Search for unusually high or low event counts

Search for unusual errors

Send stack traces of unexpected activity to ML support

SLIDE: 51

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Procmon Workflow

SLIDE: 52

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

WPA (Windows Performance Analyzer) Comprehensive, detailed tracing of all activities

On-screen highlighting and correlation of concurrent events

Available as part of Windows ADK at http://go.microsoft.com/fwlink/p/?LinkId=526740

Also be sure to check out Bruce Dawson’s UIforETW available at: https://randomascii.wordpress.com/2015/09/24/etw-central/

SLIDE: 53

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

WPA Main Window

SLIDE: 54

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Windows Tools Summary Similar to Linux there is a set of Windows tools that can be used to observe and

analyze MarkLogic performance at any level

Procexp is good for instantaneous observation

Procmon is good for finding unexpected behaviors in Windows system calls

WPA is good for correlating resource usage across MarkLogic functions

TUNING

SLIDE: 56

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Tuning Targets

Target Tunables

Application Queries

MarkLogic Server Indexing, Cache, MVCC settings

Operating System Huge Pages, Virtual Memory settings

Storage RAID Level, I/O Schedulers

SLIDE: 57

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Tuning – Application Target Queries: Query Meters, Query Plan

Code: Profiler API

For more details, see the “Performance Tuning for Developers” session

SLIDE: 58

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Tuning – Server Target Caches

Bigger caches → Less I/O

More partitions → May reduce contention

Indexes

Trade-off: Disk space vs performance

Multi-Version Concurrency Control (MVCC)

Contemporaneous vs nonblocking

SLIDE: 59

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Tuning – OS Target Huge Pages

Turn off Transparent Huge Pages (THP)

Virtual Memory

Reduce swappiness

Turn off reclaim mode (caching effect more important than locality)

SLIDE: 60

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Tuning – Storage Target RAID 10 for best performance

RAID 5 for capacity

Use noop or deadline I/O schedulers

Leverage Tiered Storage

SUMMARY

SLIDE: 62

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Summary Follow a methodology:

Understand the issue reported

Decide on what data/metrics to collect

Decide how to analyze it

Choose the appropriate tools:

Don’t trust a single tool

Gather metrics incrementally using multiple tools

Tune your targets according to MarkLogic guidelines

Q&A