Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | michael-klishin |
View: | 2,288 times |
Download: | 23 times |
Identifying (and fixing) oslo.messaging & RabbitMQ issues
Michael Klishin, PivotalDmitry Mescheryakov, Mirantis
What is oslo.messaging?
● Library for
○ building RPC clients/servers
○ emitting/handling notifications
What is oslo.messaging?
● Library for
○ building RPC clients/servers
○ emitting/handling notifications
● Supports several backends:
○ RabbitMQ
■ based on Kombu - the oldest and most well known (and we will speak about it)
■ based on Pika - recent addition
○ AMQP 1.0
○ ZeroMQ
○ Kafka
What is oslo.messaging?
● Library for
○ building RPC clients/servers
○ emitting/handling notifications
● Supports several backends:
○ RabbitMQ
■ based on Kombu - the oldest and most well known (and we will speak about it)
■ based on Pika - recent addition
○ AMQP 1.0
○ ZeroMQ
○ Kafka
● Developed entirely under OpenStack umbrella
What is oslo.messaging?
● Library for
○ building RPC clients/servers
○ emitting/handling notifications
● Supports several backends:
○ RabbitMQ
■ based on Kombu - the oldest and most well known (and we will speak about it)
■ based on Pika - recent addition
○ AMQP 1.0
○ ZeroMQ
○ Kafka
● Developed entirely under OpenStack umbrella
● Used by many OpenStack services, mostly for internal communication
Spawning a VM in Nova
nova-api
nova-api
nova-api nova-conductor
nova-conductor
nova-scheduler
nova-scheduler
nova-scheduler
nova-compute
nova-compute
nova-compute
nova-compute
Client HTTPRPC
RP
C
RPC
Examples
Internal:
● nova-compute sends a report to nova-conductor every minute
● nova-conductor sends a command to spawn a VM to nova-compute
● neutron-l3-agent requests router list from neutron-server
● …
Examples
Internal:
● nova-compute sends a report to nova-conductor every minute
● nova-conductor sends a command to spawn a VM to nova-compute
● neutron-l3-agent requests router list from neutron-server
● …
External:
● Every OpenStack service sends notifications to Ceilometer
Where is RabbitMQ in this picture?
nova-conductor
nova-compute
RabbitMQ
compute.node-1.domain.tld
reply_b6686f7be58b4773a2e0f5475368d19a
request
response
RPC
Spotting oslo.messaging logs
Spotting oslo.messaging logs
2016-04-15 11:16:57.239 16181 DEBUG nova.service [req-d83ae554-7ef5-4299-82ce-3f70b00b6490 - - - - -] Creating RPC server for service scheduler start /usr/lib/python2.7/dist-packages/nova/service.py:218
2016-04-15 11:16:57.258 16181 DEBUG oslo.messaging._drivers.pool [req-d83ae554-7ef5-4299-82ce-3f70b00b6490 - - - - -] Pool creating new connection create /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/pool.py:109
...
File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 420, in _send
result = self._waiter.wait(msg_id, timeout)
File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 318, in wait
message = self.waiters.get(msg_id, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 223, in get
'to message ID %s' % msg_id)
MessagingTimeout: Timed out waiting for a reply to message ID 9e4a677887134a0cbc134649cd46d1ce
My favorite oslo.messaging exception
oslo.messaging operations
● Cast - fire RPC request and forget about it
● Notify - the same, only format is different
● Call - send RPC request and receive reply
Call throws a MessagingTimeout exception when a reply isn’t received in a certain amount of time
Making a Call
1. Client -> request -> RabbitMQ
2. RabbitMQ -> request -> Server
3. Server processes the request and produces the response
4. Server -> response -> RabbitMQ
5. RabbitMQ -> response -> Client
If the process gets stuck on any step from 2 to 5, client gets a MessagingTimeout exception.
Debug shows the truthL3 Agent log
CALL msg_id: ae63b165611f439098f1461f906270de exchange: neutron topic: q-reports-plugin
received reply msg_id: ae63b165611f439098f1461f906270de
* Examples from Mitaka
Debug shows the truthL3 Agent log
CALL msg_id: ae63b165611f439098f1461f906270de exchange: neutron topic: q-reports-plugin
received reply msg_id: ae63b165611f439098f1461f906270de
Neutron Server
received message msg_id: ae63b165611f439098f1461f906270de reply to: reply_df2405440ffb40969a2f52c769f72e30
REPLY msg_id: ae63b165611f439098f1461f906270de reply queue: reply_df2405440ffb40969a2f52c769f72e30
* Examples from Mitaka
Enabling the debug
[DEFAULT]
debug=true
Enabling the debug
[DEFAULT]
debug=true
default_log_levels=...,oslo.messaging=DEBUG,...
If you don’t have debug enabled
Examine the stack trace
Find which operation failed
Guess the destination service
Try to find correlating log entries around the time the request was made
If you don’t have debug enabled
Examine the stack trace
Find which operation failed
Guess the destination service
Try to find correlating log entries around the time the request was made
File "/opt/stack/neutron/neutron/agent/dhcp/agent.py", line 571, in _report_state
self.state_rpc.report_state(ctx, self.agent_state, self.use_call)
File "/opt/stack/neutron/neutron/agent/rpc.py", line 86, in report_state
return method(context, 'report_state', **kwargs)
File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
retry=self.retry)
Diagnosing issues through RabbitMQ
● # rabbitmqctl list_queues consumers name
0 consumers indicate that nobody listens to the queue
● # rabbitmqctl list_queues messages consumers name
If a queue has consumers, but also messages are accumulating there. It means that the corresponding service can not process messages in time or got stuck in a deadlock or cluster is partitioned
Checking RabbitMQ cluster for integrity
# rabbitmqctl cluster_status
Check that its output contains all the nodes in the cluster. You might find that your cluster is partitioned.
Partitioning is a good reason for some messages to get stuck in queues.
How to fix such issues
For RabbitMQ issues including partitioning, see RabbitMQ docs
Restart of the affected services helps in most cases
How to fix such issues
For RabbitMQ issues including partitioning, see RabbitMQ docs
Restart of the affected services helps in most cases
Force close connections using `rabbitmqctl` or HTTP API
Never set amqp_auto_delete = true
Use a queue expiration policy instead, with a TTL of at least 1 minute
Starting from Mitaka all by default auto-delete queues were replaced with expiring ones
Why not amqp_auto_delete?
nova-conductor
nova-compute
RabbitMQ
compute.node-1.domain.tld
message
auto-delete
auto-delete = true
network hiccup
Queue mirroring is quite expensive
Out testing shows 2x drop in throughput on 3-node cluster with ‘ha-mode: all’ policy comparing with non-mirrored queues.
RPC can live without it
But notifications might be too important (if used for billing)
In later case enable mirroring for notification queues only (example in Fuel)
Use different backends for RPC and Notifications
Different drivers
* Available starting from Mitaka
Use different backends for RPC and Notifications
Different drivers
Same driver. For example:
RPC messages go through one RabbitMQ cluster
Notification messages go through another RabbitMQ cluster
* Available starting from Mitaka
Use different backends for RPC and Notifications
Different drivers
Same driver. For example:
RPC messages go through one RabbitMQ cluster
Notification messages go through another RabbitMQ cluster
Implementation (non-documented)
* Available starting from Mitaka
Part 2
Erlang VM process disappears
Erlang VM process disappears
Syslog, kern.log, /var/log/messages: grep for “killed process”
Erlang VM process disappears
Syslog, kern.log, /var/log/messages: grep for “killed process”
“Cannot allocate 1117203264527168 bytes of memory (of type …)” — move to Erlang 17.5 or 18.3
RAM usage
RAM usage
`rabbitmqctl status`
RAM usage
`rabbitmqctl status`
`rabbitmqctl list_queues name messages memory consumers`
Stats DB overload
Stats DB overload
Connections, channels, queues, and nodes emit stats on a timer
Stats DB overload
Connections, channels, queues, and nodes emit stats on a timer
With a lot of those the stats DB collector can fall behind
Stats DB overload
Connections, channels, queues, and nodes emit stats on a timer
With a lot of those the stats DB collector can fall behind
`rabbitmqctl status` reports most RAM used by `mgmt_db`
Stats DB overload
Connections, channels, queues, and nodes emit stats on a timer
With a lot of those the stats DB collector can fall behind
`rabbitmqctl status` reports most RAM used by `mgmt_db`
You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db), please_terminate).’`
Stats DB overload
Connections, channels, queues, and nodes emit stats on a timer
With a lot of those the stats DB collector can fall behind
`rabbitmqctl status` reports most RAM used by `mgmt_db`
You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db), please_terminate).’`
Resetting is a safe thing to do but may confuse your monitoring tools
Stats DB overload
Connections, channels, queues, and nodes emit stats on a timer
With a lot of those the stats DB collector can fall behind
`rabbitmqctl status` reports most RAM used by `mgmt_db`
You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db), please_terminate).’`
Resetting is a safe thing to do but may confuse your monitoring tools
New better parallelized event collector coming in RabbitMQ 3.6.2
RAM usage
`rabbitmqctl status`
`rabbitmqctl list_queues name messages memory consumers`
rabbitmq_top
RAM usage
`rabbitmqctl status`
`rabbitmqctl list_queues name messages memory consumers`
rabbitmq_top
`rabbitmqctl list_connections | wc -l`
RAM usage
`rabbitmqctl status`
`rabbitmqctl list_queues name messages memory consumers`
rabbitmq_top
`rabbitmqctl list_connections | wc -l`
`rabbitmqctl list_channels | wc -l`
RAM usage
`rabbitmqctl status`
`rabbitmqctl list_queues name messages memory consumers`
rabbitmq_top
`rabbitmqctl list_connections | wc -l`
`rabbitmqctl list_channels | wc -l`
Reduce TCP buffer size: RabbitMQ Networking guide
RAM usage
`rabbitmqctl status`
`rabbitmqctl list_queues name messages memory consumers`
rabbitmq_top
`rabbitmqctl list_connections | wc -l`
`rabbitmqctl list_channels | wc -l`
Reduce TCP buffer size: RabbitMQ Networking guide
To force per-connection channel limit use`rabbit.channel_max`.
Unresponsive nodes
Unresponsive nodes
`rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'`
Unresponsive nodes
`rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'`
Pivotal & Erlang Solutions contributed a few Mnesia deadlock fixes in Erlang/OTP 18.3.1 and 19.0
TCP connections are rejected
TCP connections are rejected
Ensure traffic on RabbitMQ ports is accepted by firewall
TCP connections are rejected
Ensure traffic on RabbitMQ ports is accepted by firewall
Ensure RabbitMQ listens on correct network interfaces
TCP connections are rejected
Ensure traffic on RabbitMQ ports is accepted by firewall
Ensure RabbitMQ listens on correct network interfaces
Check open file handles limit (defaults on Linux are completely inadequate)
TCP connections are rejected
Ensure traffic on RabbitMQ ports is accepted by firewall
Ensure RabbitMQ listens on correct network interfaces
Check open file handles limit (defaults on Linux are completely inadequate)
TCP connection backlog size: rabbitmq.tcp_listen_options.backlog, net.core.somaxconn
TCP connections are rejected
Ensure traffic on RabbitMQ ports is accepted by firewall
Ensure RabbitMQ listens on correct network interfaces
Check open file handles limit (defaults on Linux are completely inadequate)
TCP connection backlog size: rabbitmq.tcp_listen_options.backlog, net.core.somaxconn
Consult RabbitMQ logs for authentication and authorization errors
TLS connections fail
TLS connections fail
Deserves a talk of its own
TLS connections fail
Deserves a talk of its own
See log files
TLS connections fail
Deserves a talk of its own
See log files
`openssl s_client` (`man 1 s_client`)
TLS connections fail
Deserves a talk of its own
See log files
`openssl s_client` (`man 1 s_client`)
`openssl s_server` (`man 1 s_server`)
TLS connections fail
Deserves a talk of its own
See log files
`openssl s_client` (`man 1 s_client`)
`openssl s_server` (`man 1 s_server`)
Ensure peer CA certificate is trusted and verification depth is sufficient
TLS connections fail
Deserves a talk of its own
See log files
`openssl s_client` (`man 1 s_client`)
`openssl s_server` (`man 1 s_server`)
Ensure peer CA certificate is trusted and verification depth is sufficient
Troubleshooting TLS on rabbitmq.com
TLS connections fail
Deserves a talk of its own
See log files
`openssl s_client` (`man 1 s_client`)
`openssl s_server` (`man 1 s_server`)
Ensure peer CA certificate is trusted and verification depth is sufficient
Troubleshooting TLS on rabbitmq.com
Run Erlang 17.5 or 18.3.1
Message payload inspection
Message payload inspection
Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace
Message payload inspection
Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace
rabbitmq_tracing
Message payload inspection
Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace
rabbitmq_tracing
Tracing puts *very* high load on the system
Message payload inspection
Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace
rabbitmq_tracing
Tracing puts *very* high load on the system
Wireshark (tcpdump, …)
Higher than expected latency
Higher than expected latency
Wireshark (tcpdump, …)
strace, DTrace, …
Higher than expected latency
Wireshark (tcpdump, …)
strace, DTrace, …
Erlang VM scheduler-to-core binding (pinning)
General remarks
General remarks
Guessing is not effective (or efficient)
General remarks
Guessing is not effective (or efficient)
Use tools to gather more data
General remarks
Guessing is not effective (or efficient)
Use tools to gather more data
Always consult log files
General remarks
Guessing is not effective (or efficient)
Use tools to gather more data
Always consult log files
Ask on rabbitmq-users
Thank you
Thank you
@michaelklishin