Incident #11870: Two huge problems on the oVirt cluster - D4Science Infrastructure - D4science

Actions

Copy link

Incident #11870

closed

Two huge problems on the oVirt cluster

Added by Andrea Dell'Amico almost 8 years ago. Updated almost 8 years ago.

Status:

Closed

Priority:

Immediate

Assignee:

Andrea Dell'Amico

Category:

System Application

Target version:

Migrate from Xen to a new virtualisation system

Start date:

Jun 01, 2018

Due date:

% Done:

100%

Estimated time:

Infrastructure:

Development, Pre-Production, Production

Description

Yesterday, a node upgrade + restart crashed the gluster file system. It happened because even if the restarted node did not have any non synchronized bricks, some other nodes had. oVirt does not alert in that situation, while it stops the procedure if the to-be-restarted node has not sync bricks itself.

The gluster failure caused the shutdown of all the VMs configured on oVirt: the DNS resolver, the authoritative DNS server, the SMTP relay, the VPN gateways.

Related issues

Actions

Copy link

Updated by Andrea Dell'Amico almost 8 years ago

Status changed from New to Closed
% Done changed from 0 to 100

I managed to fix the gluster file system explicitly starting again the volumes. After that, the VMs did not start correctly. As I was in a rush, I rebuild from scratch the most important VMs.
Then, today, I investigated what's happened: there's a bug in the cloud-init service, that resets the interface configuration to DHCP. As we do not have a password to access from the console, the only way to fix the problem is to cold mount the VM disks and fix both the network configuration and cloud-init.

Now all the VMs are operative again. I wrote all the troubleshooting steps needed to restart glusgter, here: https://support.d4science.org/projects/aginfraplut/wiki/Gluster_management and the ones needed to fix the VMs here: https://support.d4science.org/projects/aginfraplut/wiki/Virtual_Machines_Management

I still have to add a task to the base playbook to fix the cloud-init behaviour on the newly created VMs.

Actions

Copy link