Project

General

Profile

Actions

Task #11241

closed

Reset DataMiner cluster

Added by Gianpaolo Coro over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Category:
High-Throughput-Computing
Target version:
Start date:
Feb 20, 2018
Due date:
% Done:

100%

Estimated time:
Infrastructure:
Production

Description

The production cluster services should be restarted in order to stop all the DataMiner jobs. Further, files in the ecocfg folder starting with "ar_bigdata*" should be deleted. Some DMs do not respond with GetCapabilities information for some reason but they are not excluded from the cluster. Usually, this happens if some algorithm library is not correctly installed.

However, the main reason of this restart is that I found some issues in the Argo NetCDF processing code and reduced the computation time of 70%, thus I would run the computations again.

Actions #1

Updated by Roberto Cirillo over 7 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Roberto Cirillo over 7 years ago

the fix is running right now

Actions #3

Updated by Roberto Cirillo over 7 years ago

  • Status changed from In Progress to Feedback
  • Assignee changed from Roberto Cirillo to Gianpaolo Coro
  • % Done changed from 0 to 100

Done. Some garr hosts were unreachable. This is the ansible recap:

192.168.100.13             : ok=3    changed=2    unreachable=0    failed=0   
192.168.100.14             : ok=0    changed=0    unreachable=1    failed=0   
192.168.100.19             : ok=3    changed=2    unreachable=0    failed=0   
192.168.100.3              : ok=0    changed=0    unreachable=1    failed=0   
192.168.100.6              : ok=0    changed=0    unreachable=1    failed=0   
192.168.100.7              : ok=0    changed=0    unreachable=1    failed=0   
192.168.100.8              : ok=0    changed=0    unreachable=1    failed=0   
192.168.100.9              : ok=0    changed=0    unreachable=1    failed=0   
dataminer-proto-ghost.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer0-proto.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer1-p-d4s.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer1-proto.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer2-p-d4s.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer2-proto.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer3-genericworkers.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer3-p-d4s.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer3-proto.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer4-p-d4s.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer4-proto.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer5-p-d4s.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
dataminer5-proto.d4science.org : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-166-23.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-166-24.ct1.garrservices.it : ok=0    changed=0    unreachable=1    failed=0   
ip-90-147-167-173.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-175.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-176.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-177.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-178.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-179.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-180.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-181.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-182.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-183.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-222.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-230.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-234.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-236.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0   
ip-90-147-167-237.ct1.garrservices.it : ok=3    changed=2    unreachable=0    failed=0
Actions #4

Updated by Gianpaolo Coro over 7 years ago

I don't understand why the HA proxy did not recognize they were down.

Actions #5

Updated by Roberto Cirillo over 7 years ago

  • Status changed from Feedback to Closed
  • Assignee changed from Gianpaolo Coro to Roberto Cirillo

This is not a haproxy problem. The hosts above were going in timeout when I was trying to connect via ssh to that VMs. I've run again the fix and now all the hosts have been restarted correctly.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 8.91 MB)