Date post: | 13-Apr-2017 |
Category: |
Engineering |
Upload: | alexey-lesovsky |
View: | 66 times |
Download: | 3 times |
dataegret.com
31 January Events
Failure's key points
Preventative measures
https://goo.gl/GO5rYJ
02
03
01
31 January events
01
17:20 - an LVM snapshot of the production db was taken.
19:00 - database load increased due to spam.
23:00 - secondary's replication process started to lag behind.
23:30 - PostgreSQL database directory was wiped.
31 January events01
dataegret.com
Failure's key points
02
1.LVM snapshots and staging provisioning.
2.When a replica start to lag.
3.Do pg_basebackup properly – part 1.
4.max_wal_senders was exceeded, but how?
5.max_connections = 8000.
6.pg_basebackup «stuck» – do pg_basebackup properly – part 2.
7.strace: good thing in wrong place.
8.rm or not rm?
9.A bit about backup.
10.Different PG versions on the production.
11.Broken mail.
31 January events02
dataegret.com
Snapshot impact on underlying storage.
Provisioning from backup.
Staging based on LVM snapshots02
dataegret.com
Re-initialize the standby.
Monitoring with pg_stat_replication.
Use wal_keep_segments while troubleshooting.
Use WAL archive.
When a replica started to lag02
dataegret.com
Do pg_basebackup into clean directory.
Remove «unnecessary» directory.
Use mv instead of rm.
Do pg_basebackup properly. Part 102
dataegret.com
There was only one standby (which was failed).
Increase max_wal_senders.
Check who has stolen connections.
The limit was exceeded by concurrent pg_basebackups.
max_wal_senders was exceeded.02
dataegret.com
More than 500 is bad idea.
Use pgbouncer to reduce the number of server connections.
max_connections = 800002
dataegret.com
Don't run more than one pg_basebackups.
It didn't stuck, it waited for the checkpoint.
Use «-c» option to make fast checkpoint.
Do pg_basebackup properly. Part 202
dataegret.com
Strace isn't a good tool in that case.
Use strace for system errors tracing.
Check stack trace from /proc/<pid>/stack or GDB.
Good things in wrong place.02
dataegret.com
Data directory was cleaned with rm.
Use mv instead of rm.
rm or not rm02
dataegret.com
Daily pg_dump.
Daily LVM snapshot.
Daily Azure snapshot.
PostgreSQL streaming replication.
Basebackup with WAL archive.
A bit about backup02
dataegret.com
Clean out old packages after major upgrade.
Different versions on a production02
dataegret.com
Setup cron, but forgot notifications.
Use reliable notification systems.
Different versions on a production02
dataegret.com
Preventative measures
03
1. Update PS1 across all hosts to more clearly differentiate between hosts and environments.
2. Prometheus monitoring for backups.
3. Set PostgreSQL's max_connections to a sane value.
4. Investigate Point in time recovery & continuous archiving for PostgreSQL.
5. Hourly LVM snapshots of the production databases.
6. Azure disk snapshots of production databases.
7. Move staging to the ARM environment.
8. Recover production replica(s).
9. Automated testing of recovering PostgreSQL database backups.
10.Improve PostgreSQL replication documentation/runbooks.
11.Investigate pgbarman for creating PostgreSQL backups.
12.Investigate using WAL-E as a means of Database Backup and Realtime Replication.
13.Build Streaming Database Restore.
14.Assign an owner for data durability.
Different versions on a production03
dataegret.com
1. Update PS1 across all hosts.
Looks OK.
2. Prometheus monitoring for backups.
Size, number, age and recovery status.
3. Set PostgreSQL's max_connections to a sane value.
Better use pgbouncer.
4. Investigate PITR & continuous archiving for PostgreSQL.
Yes, as the part of the backup.
Preventative measures03
dataegret.com
5. Hourly LVM snapshots of the production databases.
Looks unnecessary.
6. Azure disk snapshots of production databases.
Looks unnecessary.
7. Move staging to the ARM environment.
Very and very suspicious.
8. Recover production replica(s).
Do that asap.
Preventative measures03
dataegret.com
9. Automated testing of recovering database backups.
YES!
10. Improve documentation/runbooks.
You need a bureaucrat.
11. Investigate pgbarman.
Looks OK, Barman is stable and reliable.
12. Investigate using WAL-E.
Looks OK, WAL-E is the «setup and forget».
Preventative measures03
dataegret.com
13. Build Streaming Database Restore.
Corresponds with p.9.
14. Assign an owner for data durability.
Hire a DBA.
Preventative measures03
dataegret.com
Check and monitor backups.
Create an emergency instructions.
Learn to use tools properly.
Lessons learned03
dataegret.com
Postmortem of database outage of January 31https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
PostgreSQL Statistics Collector: pg_stat_replication viewhttps://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW
pg_basebackup utilityhttps://www.postgresql.org/docs/current/static/app-pgbasebackup.html
PostgreSQL Replicationhttps://www.postgresql.org/docs/9.6/static/runtime-config-replication.html
PgBouncerhttps://pgbouncer.github.io/https://wiki.postgresql.org/wiki/PgBouncer
Barmanhttp://www.pgbarman.org/
Links03
dataegret.com