Incident #11870
closedTwo huge problems on the oVirt cluster
100%
Description
Yesterday, a node upgrade + restart crashed the gluster file system. It happened because even if the restarted node did not have any non synchronized bricks, some other nodes had. oVirt does not alert in that situation, while it stops the procedure if the to-be-restarted node has not sync bricks itself.
The gluster failure caused the shutdown of all the VMs configured on oVirt: the DNS resolver, the authoritative DNS server, the SMTP relay, the VPN gateways.
Related issues
Updated by Andrea Dell'Amico about 7 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
I managed to fix the gluster file system explicitly starting again the volumes. After that, the VMs did not start correctly. As I was in a rush, I rebuild from scratch the most important VMs.
Then, today, I investigated what's happened: there's a bug in the cloud-init
service, that resets the interface configuration to DHCP. As we do not have a password to access from the console, the only way to fix the problem is to cold mount the VM disks and fix both the network configuration and cloud-init.
Now all the VMs are operative again. I wrote all the troubleshooting steps needed to restart glusgter, here: https://support.d4science.org/projects/aginfraplut/wiki/Gluster_management and the ones needed to fix the VMs here: https://support.d4science.org/projects/aginfraplut/wiki/Virtual_Machines_Management
I still have to add a task to the base playbook to fix the cloud-init behaviour on the newly created VMs.
Updated by Andrea Dell'Amico about 7 years ago
- Related to Task #11873: Fix the networking bug introduced by cloud-init on Ubuntu 16.04 oVirt guests added