+ All Categories
Home > Technology > Troubleshooting common oslo.messaging and RabbitMQ issues

Troubleshooting common oslo.messaging and RabbitMQ issues

Date post: 16-Apr-2017
Category:
Upload: michael-klishin
View: 2,288 times
Download: 23 times
Share this document with a friend
86
Identifying (and fixing) oslo.messaging & RabbitMQ issues Michael Klishin, Pivotal Dmitry Mescheryakov, Mirantis
Transcript
Page 1: Troubleshooting common oslo.messaging and RabbitMQ issues

Identifying (and fixing) oslo.messaging & RabbitMQ issues

Michael Klishin, PivotalDmitry Mescheryakov, Mirantis

Page 2: Troubleshooting common oslo.messaging and RabbitMQ issues

What is oslo.messaging?

● Library for

○ building RPC clients/servers

○ emitting/handling notifications

Page 3: Troubleshooting common oslo.messaging and RabbitMQ issues

What is oslo.messaging?

● Library for

○ building RPC clients/servers

○ emitting/handling notifications

● Supports several backends:

○ RabbitMQ

■ based on Kombu - the oldest and most well known (and we will speak about it)

■ based on Pika - recent addition

○ AMQP 1.0

○ ZeroMQ

○ Kafka

Page 4: Troubleshooting common oslo.messaging and RabbitMQ issues

What is oslo.messaging?

● Library for

○ building RPC clients/servers

○ emitting/handling notifications

● Supports several backends:

○ RabbitMQ

■ based on Kombu - the oldest and most well known (and we will speak about it)

■ based on Pika - recent addition

○ AMQP 1.0

○ ZeroMQ

○ Kafka

● Developed entirely under OpenStack umbrella

Page 5: Troubleshooting common oslo.messaging and RabbitMQ issues

What is oslo.messaging?

● Library for

○ building RPC clients/servers

○ emitting/handling notifications

● Supports several backends:

○ RabbitMQ

■ based on Kombu - the oldest and most well known (and we will speak about it)

■ based on Pika - recent addition

○ AMQP 1.0

○ ZeroMQ

○ Kafka

● Developed entirely under OpenStack umbrella

● Used by many OpenStack services, mostly for internal communication

Page 6: Troubleshooting common oslo.messaging and RabbitMQ issues

Spawning a VM in Nova

nova-api

nova-api

nova-api nova-conductor

nova-conductor

nova-scheduler

nova-scheduler

nova-scheduler

nova-compute

nova-compute

nova-compute

nova-compute

Client HTTPRPC

RP

C

RPC

Page 7: Troubleshooting common oslo.messaging and RabbitMQ issues

Examples

Internal:

● nova-compute sends a report to nova-conductor every minute

● nova-conductor sends a command to spawn a VM to nova-compute

● neutron-l3-agent requests router list from neutron-server

● …

Page 8: Troubleshooting common oslo.messaging and RabbitMQ issues

Examples

Internal:

● nova-compute sends a report to nova-conductor every minute

● nova-conductor sends a command to spawn a VM to nova-compute

● neutron-l3-agent requests router list from neutron-server

● …

External:

● Every OpenStack service sends notifications to Ceilometer

Page 9: Troubleshooting common oslo.messaging and RabbitMQ issues

Where is RabbitMQ in this picture?

nova-conductor

nova-compute

RabbitMQ

compute.node-1.domain.tld

reply_b6686f7be58b4773a2e0f5475368d19a

request

response

RPC

Page 10: Troubleshooting common oslo.messaging and RabbitMQ issues

Spotting oslo.messaging logs

Page 11: Troubleshooting common oslo.messaging and RabbitMQ issues

Spotting oslo.messaging logs

2016-04-15 11:16:57.239 16181 DEBUG nova.service [req-d83ae554-7ef5-4299-82ce-3f70b00b6490 - - - - -] Creating RPC server for service scheduler start /usr/lib/python2.7/dist-packages/nova/service.py:218

2016-04-15 11:16:57.258 16181 DEBUG oslo.messaging._drivers.pool [req-d83ae554-7ef5-4299-82ce-3f70b00b6490 - - - - -] Pool creating new connection create /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/pool.py:109

Page 12: Troubleshooting common oslo.messaging and RabbitMQ issues

...

File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 420, in _send

result = self._waiter.wait(msg_id, timeout)

File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 318, in wait

message = self.waiters.get(msg_id, timeout=timeout)

File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 223, in get

'to message ID %s' % msg_id)

MessagingTimeout: Timed out waiting for a reply to message ID 9e4a677887134a0cbc134649cd46d1ce

My favorite oslo.messaging exception

Page 13: Troubleshooting common oslo.messaging and RabbitMQ issues

oslo.messaging operations

● Cast - fire RPC request and forget about it

● Notify - the same, only format is different

● Call - send RPC request and receive reply

Call throws a MessagingTimeout exception when a reply isn’t received in a certain amount of time

Page 14: Troubleshooting common oslo.messaging and RabbitMQ issues

Making a Call

1. Client -> request -> RabbitMQ

2. RabbitMQ -> request -> Server

3. Server processes the request and produces the response

4. Server -> response -> RabbitMQ

5. RabbitMQ -> response -> Client

If the process gets stuck on any step from 2 to 5, client gets a MessagingTimeout exception.

Page 15: Troubleshooting common oslo.messaging and RabbitMQ issues

Debug shows the truthL3 Agent log

CALL msg_id: ae63b165611f439098f1461f906270de exchange: neutron topic: q-reports-plugin

received reply msg_id: ae63b165611f439098f1461f906270de

* Examples from Mitaka

Page 16: Troubleshooting common oslo.messaging and RabbitMQ issues

Debug shows the truthL3 Agent log

CALL msg_id: ae63b165611f439098f1461f906270de exchange: neutron topic: q-reports-plugin

received reply msg_id: ae63b165611f439098f1461f906270de

Neutron Server

received message msg_id: ae63b165611f439098f1461f906270de reply to: reply_df2405440ffb40969a2f52c769f72e30

REPLY msg_id: ae63b165611f439098f1461f906270de reply queue: reply_df2405440ffb40969a2f52c769f72e30

* Examples from Mitaka

Page 17: Troubleshooting common oslo.messaging and RabbitMQ issues

Enabling the debug

[DEFAULT]

debug=true

Page 18: Troubleshooting common oslo.messaging and RabbitMQ issues

Enabling the debug

[DEFAULT]

debug=true

default_log_levels=...,oslo.messaging=DEBUG,...

Page 19: Troubleshooting common oslo.messaging and RabbitMQ issues

If you don’t have debug enabled

Examine the stack trace

Find which operation failed

Guess the destination service

Try to find correlating log entries around the time the request was made

Page 20: Troubleshooting common oslo.messaging and RabbitMQ issues

If you don’t have debug enabled

Examine the stack trace

Find which operation failed

Guess the destination service

Try to find correlating log entries around the time the request was made

File "/opt/stack/neutron/neutron/agent/dhcp/agent.py", line 571, in _report_state

self.state_rpc.report_state(ctx, self.agent_state, self.use_call)

File "/opt/stack/neutron/neutron/agent/rpc.py", line 86, in report_state

return method(context, 'report_state', **kwargs)

File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call

retry=self.retry)

Page 21: Troubleshooting common oslo.messaging and RabbitMQ issues

Diagnosing issues through RabbitMQ

● # rabbitmqctl list_queues consumers name

0 consumers indicate that nobody listens to the queue

● # rabbitmqctl list_queues messages consumers name

If a queue has consumers, but also messages are accumulating there. It means that the corresponding service can not process messages in time or got stuck in a deadlock or cluster is partitioned

Page 22: Troubleshooting common oslo.messaging and RabbitMQ issues

Checking RabbitMQ cluster for integrity

# rabbitmqctl cluster_status

Check that its output contains all the nodes in the cluster. You might find that your cluster is partitioned.

Partitioning is a good reason for some messages to get stuck in queues.

Page 23: Troubleshooting common oslo.messaging and RabbitMQ issues

How to fix such issues

For RabbitMQ issues including partitioning, see RabbitMQ docs

Restart of the affected services helps in most cases

Page 24: Troubleshooting common oslo.messaging and RabbitMQ issues

How to fix such issues

For RabbitMQ issues including partitioning, see RabbitMQ docs

Restart of the affected services helps in most cases

Force close connections using `rabbitmqctl` or HTTP API

Page 25: Troubleshooting common oslo.messaging and RabbitMQ issues

Never set amqp_auto_delete = true

Use a queue expiration policy instead, with a TTL of at least 1 minute

Starting from Mitaka all by default auto-delete queues were replaced with expiring ones

Page 26: Troubleshooting common oslo.messaging and RabbitMQ issues

Why not amqp_auto_delete?

nova-conductor

nova-compute

RabbitMQ

compute.node-1.domain.tld

message

auto-delete

auto-delete = true

network hiccup

Page 27: Troubleshooting common oslo.messaging and RabbitMQ issues

Queue mirroring is quite expensive

Out testing shows 2x drop in throughput on 3-node cluster with ‘ha-mode: all’ policy comparing with non-mirrored queues.

RPC can live without it

But notifications might be too important (if used for billing)

In later case enable mirroring for notification queues only (example in Fuel)

Page 28: Troubleshooting common oslo.messaging and RabbitMQ issues

Use different backends for RPC and Notifications

Different drivers

* Available starting from Mitaka

Page 29: Troubleshooting common oslo.messaging and RabbitMQ issues

Use different backends for RPC and Notifications

Different drivers

Same driver. For example:

RPC messages go through one RabbitMQ cluster

Notification messages go through another RabbitMQ cluster

* Available starting from Mitaka

Page 30: Troubleshooting common oslo.messaging and RabbitMQ issues

Use different backends for RPC and Notifications

Different drivers

Same driver. For example:

RPC messages go through one RabbitMQ cluster

Notification messages go through another RabbitMQ cluster

Implementation (non-documented)

* Available starting from Mitaka

Page 31: Troubleshooting common oslo.messaging and RabbitMQ issues
Page 32: Troubleshooting common oslo.messaging and RabbitMQ issues

Part 2

Page 33: Troubleshooting common oslo.messaging and RabbitMQ issues
Page 34: Troubleshooting common oslo.messaging and RabbitMQ issues

Erlang VM process disappears

Page 35: Troubleshooting common oslo.messaging and RabbitMQ issues

Erlang VM process disappears

Syslog, kern.log, /var/log/messages: grep for “killed process”

Page 36: Troubleshooting common oslo.messaging and RabbitMQ issues

Erlang VM process disappears

Syslog, kern.log, /var/log/messages: grep for “killed process”

“Cannot allocate 1117203264527168 bytes of memory (of type …)” — move to Erlang 17.5 or 18.3

Page 37: Troubleshooting common oslo.messaging and RabbitMQ issues

RAM usage

Page 38: Troubleshooting common oslo.messaging and RabbitMQ issues

RAM usage

`rabbitmqctl status`

Page 39: Troubleshooting common oslo.messaging and RabbitMQ issues

RAM usage

`rabbitmqctl status`

`rabbitmqctl list_queues name messages memory consumers`

Page 40: Troubleshooting common oslo.messaging and RabbitMQ issues

Stats DB overload

Page 41: Troubleshooting common oslo.messaging and RabbitMQ issues

Stats DB overload

Connections, channels, queues, and nodes emit stats on a timer

Page 42: Troubleshooting common oslo.messaging and RabbitMQ issues

Stats DB overload

Connections, channels, queues, and nodes emit stats on a timer

With a lot of those the stats DB collector can fall behind

Page 43: Troubleshooting common oslo.messaging and RabbitMQ issues

Stats DB overload

Connections, channels, queues, and nodes emit stats on a timer

With a lot of those the stats DB collector can fall behind

`rabbitmqctl status` reports most RAM used by `mgmt_db`

Page 44: Troubleshooting common oslo.messaging and RabbitMQ issues

Stats DB overload

Connections, channels, queues, and nodes emit stats on a timer

With a lot of those the stats DB collector can fall behind

`rabbitmqctl status` reports most RAM used by `mgmt_db`

You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db), please_terminate).’`

Page 45: Troubleshooting common oslo.messaging and RabbitMQ issues

Stats DB overload

Connections, channels, queues, and nodes emit stats on a timer

With a lot of those the stats DB collector can fall behind

`rabbitmqctl status` reports most RAM used by `mgmt_db`

You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db), please_terminate).’`

Resetting is a safe thing to do but may confuse your monitoring tools

Page 46: Troubleshooting common oslo.messaging and RabbitMQ issues

Stats DB overload

Connections, channels, queues, and nodes emit stats on a timer

With a lot of those the stats DB collector can fall behind

`rabbitmqctl status` reports most RAM used by `mgmt_db`

You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db), please_terminate).’`

Resetting is a safe thing to do but may confuse your monitoring tools

New better parallelized event collector coming in RabbitMQ 3.6.2

Page 47: Troubleshooting common oslo.messaging and RabbitMQ issues

RAM usage

`rabbitmqctl status`

`rabbitmqctl list_queues name messages memory consumers`

rabbitmq_top

Page 48: Troubleshooting common oslo.messaging and RabbitMQ issues

RAM usage

`rabbitmqctl status`

`rabbitmqctl list_queues name messages memory consumers`

rabbitmq_top

`rabbitmqctl list_connections | wc -l`

Page 49: Troubleshooting common oslo.messaging and RabbitMQ issues

RAM usage

`rabbitmqctl status`

`rabbitmqctl list_queues name messages memory consumers`

rabbitmq_top

`rabbitmqctl list_connections | wc -l`

`rabbitmqctl list_channels | wc -l`

Page 50: Troubleshooting common oslo.messaging and RabbitMQ issues

RAM usage

`rabbitmqctl status`

`rabbitmqctl list_queues name messages memory consumers`

rabbitmq_top

`rabbitmqctl list_connections | wc -l`

`rabbitmqctl list_channels | wc -l`

Reduce TCP buffer size: RabbitMQ Networking guide

Page 51: Troubleshooting common oslo.messaging and RabbitMQ issues

RAM usage

`rabbitmqctl status`

`rabbitmqctl list_queues name messages memory consumers`

rabbitmq_top

`rabbitmqctl list_connections | wc -l`

`rabbitmqctl list_channels | wc -l`

Reduce TCP buffer size: RabbitMQ Networking guide

To force per-connection channel limit use`rabbit.channel_max`.

Page 52: Troubleshooting common oslo.messaging and RabbitMQ issues

Unresponsive nodes

Page 53: Troubleshooting common oslo.messaging and RabbitMQ issues

Unresponsive nodes

`rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'`

Page 54: Troubleshooting common oslo.messaging and RabbitMQ issues

Unresponsive nodes

`rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'`

Pivotal & Erlang Solutions contributed a few Mnesia deadlock fixes in Erlang/OTP 18.3.1 and 19.0

Page 55: Troubleshooting common oslo.messaging and RabbitMQ issues

TCP connections are rejected

Page 56: Troubleshooting common oslo.messaging and RabbitMQ issues

TCP connections are rejected

Ensure traffic on RabbitMQ ports is accepted by firewall

Page 57: Troubleshooting common oslo.messaging and RabbitMQ issues

TCP connections are rejected

Ensure traffic on RabbitMQ ports is accepted by firewall

Ensure RabbitMQ listens on correct network interfaces

Page 58: Troubleshooting common oslo.messaging and RabbitMQ issues

TCP connections are rejected

Ensure traffic on RabbitMQ ports is accepted by firewall

Ensure RabbitMQ listens on correct network interfaces

Check open file handles limit (defaults on Linux are completely inadequate)

Page 59: Troubleshooting common oslo.messaging and RabbitMQ issues

TCP connections are rejected

Ensure traffic on RabbitMQ ports is accepted by firewall

Ensure RabbitMQ listens on correct network interfaces

Check open file handles limit (defaults on Linux are completely inadequate)

TCP connection backlog size: rabbitmq.tcp_listen_options.backlog, net.core.somaxconn

Page 60: Troubleshooting common oslo.messaging and RabbitMQ issues

TCP connections are rejected

Ensure traffic on RabbitMQ ports is accepted by firewall

Ensure RabbitMQ listens on correct network interfaces

Check open file handles limit (defaults on Linux are completely inadequate)

TCP connection backlog size: rabbitmq.tcp_listen_options.backlog, net.core.somaxconn

Consult RabbitMQ logs for authentication and authorization errors

Page 61: Troubleshooting common oslo.messaging and RabbitMQ issues

TLS connections fail

Page 62: Troubleshooting common oslo.messaging and RabbitMQ issues

TLS connections fail

Deserves a talk of its own

Page 63: Troubleshooting common oslo.messaging and RabbitMQ issues

TLS connections fail

Deserves a talk of its own

See log files

Page 64: Troubleshooting common oslo.messaging and RabbitMQ issues

TLS connections fail

Deserves a talk of its own

See log files

`openssl s_client` (`man 1 s_client`)

Page 65: Troubleshooting common oslo.messaging and RabbitMQ issues

TLS connections fail

Deserves a talk of its own

See log files

`openssl s_client` (`man 1 s_client`)

`openssl s_server` (`man 1 s_server`)

Page 66: Troubleshooting common oslo.messaging and RabbitMQ issues

TLS connections fail

Deserves a talk of its own

See log files

`openssl s_client` (`man 1 s_client`)

`openssl s_server` (`man 1 s_server`)

Ensure peer CA certificate is trusted and verification depth is sufficient

Page 67: Troubleshooting common oslo.messaging and RabbitMQ issues

TLS connections fail

Deserves a talk of its own

See log files

`openssl s_client` (`man 1 s_client`)

`openssl s_server` (`man 1 s_server`)

Ensure peer CA certificate is trusted and verification depth is sufficient

Troubleshooting TLS on rabbitmq.com

Page 68: Troubleshooting common oslo.messaging and RabbitMQ issues

TLS connections fail

Deserves a talk of its own

See log files

`openssl s_client` (`man 1 s_client`)

`openssl s_server` (`man 1 s_server`)

Ensure peer CA certificate is trusted and verification depth is sufficient

Troubleshooting TLS on rabbitmq.com

Run Erlang 17.5 or 18.3.1

Page 69: Troubleshooting common oslo.messaging and RabbitMQ issues

Message payload inspection

Page 70: Troubleshooting common oslo.messaging and RabbitMQ issues

Message payload inspection

Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace

Page 71: Troubleshooting common oslo.messaging and RabbitMQ issues

Message payload inspection

Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace

rabbitmq_tracing

Page 72: Troubleshooting common oslo.messaging and RabbitMQ issues

Message payload inspection

Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace

rabbitmq_tracing

Tracing puts *very* high load on the system

Page 73: Troubleshooting common oslo.messaging and RabbitMQ issues

Message payload inspection

Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace

rabbitmq_tracing

Tracing puts *very* high load on the system

Wireshark (tcpdump, …)

Page 74: Troubleshooting common oslo.messaging and RabbitMQ issues

Higher than expected latency

Page 75: Troubleshooting common oslo.messaging and RabbitMQ issues

Higher than expected latency

Wireshark (tcpdump, …)

Page 76: Troubleshooting common oslo.messaging and RabbitMQ issues

Higher than expected latency

Wireshark (tcpdump, …)

strace, DTrace, …

Page 77: Troubleshooting common oslo.messaging and RabbitMQ issues

Higher than expected latency

Wireshark (tcpdump, …)

strace, DTrace, …

Erlang VM scheduler-to-core binding (pinning)

Page 78: Troubleshooting common oslo.messaging and RabbitMQ issues

General remarks

Page 79: Troubleshooting common oslo.messaging and RabbitMQ issues

General remarks

Guessing is not effective (or efficient)

Page 80: Troubleshooting common oslo.messaging and RabbitMQ issues

General remarks

Guessing is not effective (or efficient)

Use tools to gather more data

Page 81: Troubleshooting common oslo.messaging and RabbitMQ issues

General remarks

Guessing is not effective (or efficient)

Use tools to gather more data

Always consult log files

Page 82: Troubleshooting common oslo.messaging and RabbitMQ issues

General remarks

Guessing is not effective (or efficient)

Use tools to gather more data

Always consult log files

Ask on rabbitmq-users

Page 83: Troubleshooting common oslo.messaging and RabbitMQ issues
Page 84: Troubleshooting common oslo.messaging and RabbitMQ issues

Thank you

Page 85: Troubleshooting common oslo.messaging and RabbitMQ issues

Thank you

@michaelklishin

Page 86: Troubleshooting common oslo.messaging and RabbitMQ issues

Thank you

@michaelklishin

rabbitmq-users


Recommended