+ All Categories
Home > Documents > Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and...

Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and...

Date post: 05-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
64
www.data61.csiro.au Adventures in High(ish) Availability Peter Chubb | Principal Research Engineer January 21, 2019
Transcript
Page 1: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

www.data61.csiro.au

Adventures in High(ish) AvailabilityPeter Chubb | Principal Research EngineerJanuary 21, 2019

Page 2: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

services

• DNS, DHCP/BOOTP, LDAP, NFS, TFTP, Postgres, kitty, web services, CI services(bambooTM), login, hg, git, machine-queue, bitbucketTM . . .

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 2

Page 3: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

services

• DNS, DHCP/BOOTP, LDAP, NFS, TFTP, Postgres, kitty, web services, CI services(bambooTM), login, hg, git, machine-queue, bitbucketTM . . .

• around 40 desktops using DHCP, NFS and LDAP

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 2

Page 4: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

services

• DNS, DHCP/BOOTP, LDAP, NFS, TFTP, Postgres, kitty, web services, CI services(bambooTM), login, hg, git, machine-queue, bitbucketTM . . .

• around 40 desktops using DHCP, NFS and LDAP

• around 30 dev boards and test machines using BOOTP, TFTP, and NFS

– rebooting every few minutes; different mac address every reboot

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 2

Page 5: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

The Situation

• Ancient server hardware (donated to us in 2000 or thereabouts)

• Only some services replicated (DNS, LDAP both master/slave)

• Growing group — downtime costs more

• Desire for planned downtime (kernel upgrades, hardware changes etc)

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 3

Page 6: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

The Situation

• Ancient server hardware (donated to us in 2000 or thereabouts)

• Only some services replicated (DNS, LDAP both master/slave)

• Growing group — downtime costs more

• Desire for planned downtime (kernel upgrades, hardware changes etc)

• applied for Capex funding for new server

– Huge corporate discount

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 3

Page 7: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

The Situation

• Ancient server hardware (donated to us in 2000 or thereabouts)

• Only some services replicated (DNS, LDAP both master/slave)

• Growing group — downtime costs more

• Desire for planned downtime (kernel upgrades, hardware changes etc)

• applied for Capex funding for new server

– Huge corporate discount

→ Buy Two!

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 3

Page 8: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

High(ish) availability

99.99999999999999999999999999999999999999999999999999999999%

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 4

Page 9: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

High(ish) availability

99.99999999999999999999999999999999999999999999999999999999%

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 4

Page 10: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

High(ish) availability

99.99999999999999999999999999999999999999999999999999999999%

A few minutes here and there don’t matter

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 4

Page 11: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

High(ish) availability

99.99999999999999999999999999999999999999999999999999999999%

A few minutes here and there don’t matterManual failover for new kernel, replace network card etc. OK

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 4

Page 12: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Two Servers!

• 24 core

• 300G Ram

• 16Tb spinning Disk with 1.2TbRAID-1 nVME cache

• 2x10Gb/s fibre, 8x1Gb/s copper

Replication and/or failover possible.

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 5

Page 13: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Two Servers!

Stopped

Running

containers

Hosts

Cellar Brewer

DNS

ldap

tftp

web

login

NFS

DNS

NFS

tftp

ldap

web

login

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 6

Page 14: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Two Servers!

Cellar Brewer

DNS

ldap

tftp

web

login

NFS

DNS

NFS

tftp

ldap

web

login

lsyncd

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 6

Page 15: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Two Servers!

Cellar Brewer

DNS

NFS

ldap

web

login

DNS

ldap

web

login

NFS

tftptftp

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 6

Page 16: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Two Servers!

Cellar Brewer

DNS

NFS

ldap

web

login

DNS

ldap

web

login

NFS

tftptftp

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 6

Page 17: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Testing

7.00am Came into work; Turned coffee machine on; checked logwatch

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 7

Page 18: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Testing

7.00am Came into work; Turned coffee machine on; checked logwatch

7:15am Attempted failover: shutdown one host

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 7

Page 19: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Testing

7.00am Came into work; Turned coffee machine on; checked logwatch

7:15am Attempted failover: shutdown one host

7:40am Looking good: services all transferred and running

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 7

Page 20: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Testing

7.00am Came into work; Turned coffee machine on; checked logwatch

7:15am Attempted failover: shutdown one host

7:40am Looking good: services all transferred and running

7:45am get coffee

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 7

Page 21: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Testing

7:50am Notice login xterms have frozen: can’t log back in. Attempt to get into host’sconsoles — can’t do it as me; manage to remember root password. Veryslow response.

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 8

Page 22: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Testing

7:50am Notice login xterms have frozen: can’t log back in. Attempt to get into host’sconsoles — can’t do it as me; manage to remember root password. Veryslow response.

8:00am get warning (to phone) that webservers are down

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 8

Page 23: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Testing

7:50am Notice login xterms have frozen: can’t log back in. Attempt to get into host’sconsoles — can’t do it as me; manage to remember root password. Veryslow response.

8:00am get warning (to phone) that webservers are down

8:10am On console, NFS server not responding; can’t connect to nfshomes:no DNS entry.

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 8

Page 24: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Testing

7:50am Notice login xterms have frozen: can’t log back in. Attempt to get into host’sconsoles — can’t do it as me; manage to remember root password. Veryslow response.

8:00am get warning (to phone) that webservers are down

8:10am On console, NFS server not responding; can’t connect to nfshomes:no DNS entry.

8:15am (people start arriving at work; can’t work: no local DNS)

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 8

Page 25: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Testing

7:50am Notice login xterms have frozen: can’t log back in. Attempt to get into host’sconsoles — can’t do it as me; manage to remember root password. Veryslow response.

8:00am get warning (to phone) that webservers are down

8:10am On console, NFS server not responding; can’t connect to nfshomes:no DNS entry.

8:15am (people start arriving at work; can’t work: no local DNS)

8:20am reboot original server; restart original services one at a time; fail back

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 8

Page 26: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Testing

7:50am Notice login xterms have frozen: can’t log back in. Attempt to get into host’sconsoles — can’t do it as me; manage to remember root password. Veryslow response.

8:00am get warning (to phone) that webservers are down

8:10am On console, NFS server not responding; can’t connect to nfshomes:no DNS entry.

8:15am (people start arriving at work; can’t work: no local DNS)

8:20am reboot original server; restart original services one at a time; fail back

11am Everything seems normal again; get another coffee

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 8

Page 27: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

PROBLEMS

• DHCP can’t update names on slave server

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 9

Page 28: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

PROBLEMS

• DHCP can’t update names on slave server

• DNS entries time out if master is down.

– Timeouts are short to cope with devboard short lease lifetimes

– Everything stops if DNS stops

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 9

Page 29: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

PROBLEMS

• DHCP can’t update names on slave server

• DNS entries time out if master is down.

• NFS after failover fails

– Handle based on inode number and File-System ID — inode numbers different

– NFSv4 is stateful

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 9

Page 30: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

PROBLEMS

• DHCP can’t update names on slave server

• DNS entries time out if master is down.

• NFS after failover fails

• Run out of watch slots for lsyncd

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 9

Page 31: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

PROBLEMS

• DHCP can’t update names on slave server

• DNS entries time out if master is down.

• NFS after failover fails

• Run out of watch slots for lsyncd

• Postgres failover (sort-of) OK; fail-back difficult

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 9

Page 32: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Second attempt

• Stateless services as before

• Per-service solutions for the rest

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 10

Page 33: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

LDAP

• Not hard to make openldap replicate master-master.

• Round-robin DNS allows load sharing

• SSSD on clients mean short outages don’t matter (much).

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 11

Page 34: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

LDAP

• Not hard to make openldap replicate master-master.

• Round-robin DNS allows load sharing

• SSSD on clients mean short outages don’t matter (much).

Works!

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 11

Page 35: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

DNS

• LDAP replication working ...

– So use LDAP as backend.

∗ bind9-dyndb-ldap already packaged for Debian

– Works well with BIND 9.11

– Multi-master DNS ‘tricky’, but seems to work.

– Running in containers on both hosts as masters; watchdog ensures containers arerunning

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 12

Page 36: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

DNS

• LDAP replication working ...

– So use LDAP as backend.

∗ bind9-dyndb-ldap already packaged for Debian

– Works well with BIND 9.11

– Multi-master DNS ‘tricky’, but seems to work.

– Running in containers on both hosts as masters; watchdog ensures containers arerunning

Works!

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 12

Page 37: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

DHCP

• Still have bootp clients — can’t use native replication

• Server runs in same container as one of the DNS servers, to allow name update

• watchdog in each DNS container starts DHCPD if it is not running on the DNS replica

• /etc/dhcpd.conf held in GIT, git pull on start.

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 13

Page 38: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

DHCP

• Still have bootp clients — can’t use native replication

• Server runs in same container as one of the DNS servers, to allow name update

• watchdog in each DNS container starts DHCPD if it is not running on the DNS replica

• /etc/dhcpd.conf held in GIT, git pull on start.

Works

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 13

Page 39: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

NFS

• DRBD for underlying FS

• NFSv4 state on one of the replicated volumes

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 14

Page 40: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

NFS

1. Check switches are up. Abort if not

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 15

Page 41: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

NFS

1. Check switches are up. Abort if not

2. Check if DRBD is up-to-date. Abort if not.

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 15

Page 42: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

NFS

1. Check switches are up. Abort if not

2. Check if DRBD is up-to-date. Abort if not.

3. If remote is up, shut it down:

• stop nfs-kernel-server and rpcbind

• unmount exported volumes

• delete the HA address

• Check to see that the HA address is gone; if not, destroy the container.

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 15

Page 43: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

NFS

1. Check switches are up. Abort if not

2. Check if DRBD is up-to-date. Abort if not.

3. If remote is up, shut it down:

4. switch the local DRBD to primary

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 15

Page 44: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

NFS

1. Check switches are up. Abort if not

2. Check if DRBD is up-to-date. Abort if not.

3. If remote is up, shut it down:

4. switch the local DRBD to primary

5. Start the local container if nec.

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 15

Page 45: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

NFS

1. Check switches are up. Abort if not

2. Check if DRBD is up-to-date. Abort if not.

3. If remote is up, shut it down:

4. switch the local DRBD to primary

5. Start the local container if nec.

6. (in container) mount the filesystems, add the HA address, start nfs-kernel-server

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 15

Page 46: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

NFS

Sort-of works.

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 16

Page 47: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

NFS

Sort-of works.

Planned failovers work

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 16

Page 48: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

NFS

Sort-of works.

Planned failovers work

Often see partial failover (DRBD switches rôles for some discs)

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 16

Page 49: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

NFS

Sort-of works.

Planned failovers work

Often see partial failover (DRBD switches rôles for some discs)

Still investigating — packet loss?

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 16

Page 50: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

NFS

Sort-of works.

Planned failovers work

Often see partial failover (DRBD switches rôles for some discs)

Still investigating — packet loss?

Also DAD races for IPv6.

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 16

Page 51: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Postgres

• Write-Ahead Log shipping for replication supported

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 17

Page 52: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Postgres

• Write-Ahead Log shipping for replication supported

– With ‘just a bit’ of configuration

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 17

Page 53: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Postgres

• Write-Ahead Log shipping for replication supported

– With ‘just a bit’ of configuration

• Easy to trigger failover

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 17

Page 54: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Postgres

• Write-Ahead Log shipping for replication supported

– With ‘just a bit’ of configuration

• Easy to trigger failover

BUT

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 17

Page 55: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Postgres

• Write-Ahead Log shipping for replication supported

– With ‘just a bit’ of configuration

• Easy to trigger failover

BUT

• Clients don’t know of failover

• No load balancing between active instances

• Fail-back is hard

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 17

Page 56: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Postgres

Investigating Patroni as a solution.

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 18

Page 57: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Remaining Issues

is_up(){

ping -c 1 "$1" > /dev/null 2>&1

}

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 19

Page 58: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Remaining Issues

packet loss or congestion causes false down indications.

is_up() {

for t in 5 10 30

do

ping -c 1 "$1" > /dev/null 2>&1 && return 0

sleep $t

done

ping -c1 "$1" > /dev/null 2>&1

}

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 20

Page 59: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Remaining Issues

Where possible check service not container:

is_up() {

pg_isready "$1" > /dev/null 2>&1

}

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 21

Page 60: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Orphan Zombies

$ ps axf

...

26313 ? Sl 0:00 /usr/lib/libvirt/libvirt_lxc --name nfshomes ...

26355 ? Ss 0:19 \_ /sbin/init

26455 ? Ss 1:49 \_ /lib/systemd/systemd-journald

26468 ? Ss 0:00 \_ /usr/sbin/blkmapd

...

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 22

Page 61: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Orphan Zombies

$ ps axf

...

25234 ? Ss 2:16 [init]

32097 ? Zl 2:11 \_ [apache2] <defunct>

...

• Orphan Zombies

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 23

Page 62: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

Orphan Zombies

$ ps axf

...

25234 ? Ss 2:16 [init]

32097 ? Zl 2:11 \_ [apache2] <defunct>

...

• Orphan Zombies

– Kill them all!∗ every 30 min/usr/local/bin/kill-orphans

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 23

Page 63: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

But why not use . . .

• corosync and pacemaker

• piranha

• Etc

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 24

Page 64: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting

scripts

Available at: https://bitbucket.csiro.au/projects/TRUSTWORTHYSYSTEMS/repos/hiavail/browse

CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 25


Recommended