GitLab PostgresMortem: Lessons Learned

GitLab PostgresMortem:Lessons Learned

Alexey Lesovsky

[email protected]

dataegret.com

31 January Events

Failure's key points

Preventative measures

https://goo.gl/GO5rYJ

02

03

01

https://goo.gl/GO5rYJ

31 January events

01

17:20 - an LVM snapshot of the production db was taken.

19:00 - database load increased due to spam.

23:00 - secondary's replication process started to lag behind.

23:30 - PostgreSQL database directory was wiped.

31 January events01

dataegret.com

Failure's key points

02

1.LVM snapshots and staging provisioning.

2.When a replica start to lag.

3.Do pg_basebackup properly – part 1.

4.max_wal_senders was exceeded, but how?

5.max_connections = 8000.

6.pg_basebackup «stuck» – do pg_basebackup properly – part 2.

7.strace: good thing in wrong place.

8.rm or not rm?

9.A bit about backup.

10.Different PG versions on the production.

11.Broken mail.

31 January events02

dataegret.com

Snapshot impact on underlying storage.

Provisioning from backup.

Staging based on LVM snapshots02

dataegret.com

Re-initialize the standby.

Monitoring with pg_stat_replication.

Use wal_keep_segments while troubleshooting.

Use WAL archive.

When a replica started to lag02

dataegret.com

Do pg_basebackup into clean directory.

Remove «unnecessary» directory.

Use mv instead of rm.

Do pg_basebackup properly. Part 102

dataegret.com

There was only one standby (which was failed).

Increase max_wal_senders.

Check who has stolen connections.

The limit was exceeded by concurrent pg_basebackups.

max_wal_senders was exceeded.02

dataegret.com

More than 500 is bad idea.

Use pgbouncer to reduce the number of server connections.

max_connections = 800002

dataegret.com

Don't run more than one pg_basebackups.

It didn't stuck, it waited for the checkpoint.

Use «-c» option to make fast checkpoint.

Do pg_basebackup properly. Part 202

dataegret.com

Strace isn't a good tool in that case.

Use strace for system errors tracing.

Check stack trace from /proc/<pid>/stack or GDB.

Good things in wrong place.02

dataegret.com

Data directory was cleaned with rm.

Use mv instead of rm.

rm or not rm02

dataegret.com

Daily pg_dump.

Daily LVM snapshot.

Daily Azure snapshot.

PostgreSQL streaming replication.

Basebackup with WAL archive.

A bit about backup02

dataegret.com

Clean out old packages after major upgrade.

Different versions on a production02

dataegret.com

Setup cron, but forgot notifications.

Use reliable notification systems.


dataegret.com

Preventative measures

03

1. Update PS1 across all hosts to more clearly differentiate between hosts and environments.

2. Prometheus monitoring for backups.

3. Set PostgreSQL's max_connections to a sane value.

4. Investigate Point in time recovery & continuous archiving for PostgreSQL.

5. Hourly LVM snapshots of the production databases.

6. Azure disk snapshots of production databases.

7. Move staging to the ARM environment.

8. Recover production replica(s).

9. Automated testing of recovering PostgreSQL database backups.

10.Improve PostgreSQL replication documentation/runbooks.

11.Investigate pgbarman for creating PostgreSQL backups.

12.Investigate using WAL-E as a means of Database Backup and Realtime Replication.

13.Build Streaming Database Restore.

14.Assign an owner for data durability.


dataegret.com

1. Update PS1 across all hosts.

Looks OK.

2. Prometheus monitoring for backups.

Size, number, age and recovery status.

3. Set PostgreSQL's max_connections to a sane value.

Better use pgbouncer.

4. Investigate PITR & continuous archiving for PostgreSQL.

Yes, as the part of the backup.

Preventative measures03

dataegret.com

5. Hourly LVM snapshots of the production databases.

Looks unnecessary.

6. Azure disk snapshots of production databases.

Looks unnecessary.

7. Move staging to the ARM environment.

Very and very suspicious.

8. Recover production replica(s).

Do that asap.


dataegret.com

9. Automated testing of recovering database backups.

YES!

10. Improve documentation/runbooks.

You need a bureaucrat.

11. Investigate pgbarman.

Looks OK, Barman is stable and reliable.

12. Investigate using WAL-E.

Looks OK, WAL-E is the «setup and forget».


dataegret.com

13. Build Streaming Database Restore.

Corresponds with p.9.

14. Assign an owner for data durability.

Hire a DBA.


dataegret.com

Check and monitor backups.

Create an emergency instructions.

Learn to use tools properly.

Lessons learned03

dataegret.com

Postmortem of database outage of January 31https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

PostgreSQL Statistics Collector: pg_stat_replication viewhttps://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW

pg_basebackup utilityhttps://www.postgresql.org/docs/current/static/app-pgbasebackup.html

PostgreSQL Replicationhttps://www.postgresql.org/docs/9.6/static/runtime-config-replication.html

PgBouncerhttps://pgbouncer.github.io/https://wiki.postgresql.org/wiki/PgBouncer

Barmanhttp://www.pgbarman.org/

Links03

dataegret.com

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW

https://www.postgresql.org/docs/current/static/app-pgbasebackup.html

https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html

https://pgbouncer.github.io/

https://wiki.postgresql.org/wiki/PgBouncer

http://www.pgbarman.org/

Thanks for watching!

dataegret.com [email protected]

Date post:	13-Apr-2017
Category:	Engineering
Upload:	alexey-lesovsky
View:	66 times
Download:	3 times

GitLab PostgresMortem: Lessons Learned

Engineering