Task #11241: Reset DataMiner cluster - D4Science Infrastructure - D4science

Actions

Copy link

Task #11241

closed

Reset DataMiner cluster

Added by Gianpaolo Coro over 8 years ago. Updated over 8 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Roberto Cirillo

Category:

High-Throughput-Computing

Target version:

Data Processing

Start date:

Feb 20, 2018

Due date:

% Done:

100%

Estimated time:

Infrastructure:

Production

Description

The production cluster services should be restarted in order to stop all the DataMiner jobs. Further, files in the ecocfg folder starting with "ar_bigdata*" should be deleted. Some DMs do not respond with GetCapabilities information for some reason but they are not excluded from the cluster. Usually, this happens if some algorithm library is not correctly installed.

However, the main reason of this restart is that I found some issues in the Argo NetCDF processing code and reduced the computation time of 70%, thus I would run the computations again.

Actions

Copy link

Updated by Roberto Cirillo over 8 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Roberto Cirillo over 8 years ago

the fix is running right now

Actions

Copy link

Updated by Roberto Cirillo over 8 years ago

Status changed from In Progress to Feedback
Assignee changed from Roberto Cirillo to Gianpaolo Coro
% Done changed from 0 to 100

Done. Some garr hosts were unreachable. This is the ansible recap:

192.168.100.13             : ok=3    changed=2    unreachable=0    failed=0   
192.168.100.14             : ok=0    changed=0    unreachable=1    failed=0   
192.168.100.19             : ok=3    changed=2    unreachable=0    failed=0   
192.168.100.3              : ok=0    changed=0    unreachable=1    failed=0   
192.168.100.6              : ok=0    changed=0    unreachable=1    failed=0   
192.168.100.7              : ok=0    changed=0    unreachable=1    failed=0   
192.168.100.8              : ok=0    changed=0    unreachable=1    failed=0   
192.168.100.9              : ok=0    changed=0    unreachable=1    failed=0   
dataminer-proto-ghost.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer0-proto.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer1-p-d4s.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer1-proto.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer2-p-d4s.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer2-proto.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer3-genericworkers.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer3-p-d4s.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer3-proto.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer4-p-d4s.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer4-proto.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer5-p-d4s.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer5-proto.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-166-23.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-166-24.ct1.garrservices.it : ok=0    changed=0    unreachable=1    failed=0   
ip-90-147-167-173.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-175.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-176.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-177.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-178.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-179.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-180.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-181.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-182.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-183.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-222.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-230.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-234.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-236.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-237.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0

Actions

Copy link

Updated by Gianpaolo Coro over 8 years ago

I don't understand why the HA proxy did not recognize they were down.

Actions

Copy link

Updated by Roberto Cirillo over 8 years ago

Status changed from Feedback to Closed
Assignee changed from Gianpaolo Coro to Roberto Cirillo

This is not a haproxy problem. The hosts above were going in timeout when I was trying to connect via ssh to that VMs. I've run again the fix and now all the hosts have been restarted correctly.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

D4Science Infrastructure

Custom queries

Task #11241

Reset DataMiner cluster

Updated by Roberto Cirillo over 8 years ago

Updated by Roberto Cirillo over 8 years ago

Updated by Roberto Cirillo over 8 years ago

Updated by Gianpaolo Coro over 8 years ago

Updated by Roberto Cirillo over 8 years ago