Task #11241
closedReset DataMiner cluster
100%
Description
The production cluster services should be restarted in order to stop all the DataMiner jobs. Further, files in the ecocfg folder starting with "ar_bigdata*" should be deleted. Some DMs do not respond with GetCapabilities information for some reason but they are not excluded from the cluster. Usually, this happens if some algorithm library is not correctly installed.
However, the main reason of this restart is that I found some issues in the Argo NetCDF processing code and reduced the computation time of 70%, thus I would run the computations again.
Updated by Roberto Cirillo over 7 years ago
- Status changed from New to In Progress
Updated by Roberto Cirillo over 7 years ago
- Status changed from In Progress to Feedback
- Assignee changed from Roberto Cirillo to Gianpaolo Coro
- % Done changed from 0 to 100
Done. Some garr hosts were unreachable. This is the ansible recap:
192.168.100.13 : ok=3 changed=2 unreachable=0 failed=0 192.168.100.14 : ok=0 changed=0 unreachable=1 failed=0 192.168.100.19 : ok=3 changed=2 unreachable=0 failed=0 192.168.100.3 : ok=0 changed=0 unreachable=1 failed=0 192.168.100.6 : ok=0 changed=0 unreachable=1 failed=0 192.168.100.7 : ok=0 changed=0 unreachable=1 failed=0 192.168.100.8 : ok=0 changed=0 unreachable=1 failed=0 192.168.100.9 : ok=0 changed=0 unreachable=1 failed=0 dataminer-proto-ghost.d4science.org : ok=3 changed=2 unreachable=0 failed=0 dataminer0-proto.d4science.org : ok=3 changed=2 unreachable=0 failed=0 dataminer1-p-d4s.d4science.org : ok=3 changed=2 unreachable=0 failed=0 dataminer1-proto.d4science.org : ok=3 changed=2 unreachable=0 failed=0 dataminer2-p-d4s.d4science.org : ok=3 changed=2 unreachable=0 failed=0 dataminer2-proto.d4science.org : ok=3 changed=2 unreachable=0 failed=0 dataminer3-genericworkers.d4science.org : ok=3 changed=2 unreachable=0 failed=0 dataminer3-p-d4s.d4science.org : ok=3 changed=2 unreachable=0 failed=0 dataminer3-proto.d4science.org : ok=3 changed=2 unreachable=0 failed=0 dataminer4-p-d4s.d4science.org : ok=3 changed=2 unreachable=0 failed=0 dataminer4-proto.d4science.org : ok=3 changed=2 unreachable=0 failed=0 dataminer5-p-d4s.d4science.org : ok=3 changed=2 unreachable=0 failed=0 dataminer5-proto.d4science.org : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-166-23.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-166-24.ct1.garrservices.it : ok=0 changed=0 unreachable=1 failed=0 ip-90-147-167-173.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-175.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-176.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-177.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-178.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-179.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-180.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-181.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-182.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-183.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-222.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-230.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-234.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-236.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0 ip-90-147-167-237.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
Updated by Gianpaolo Coro over 7 years ago
I don't understand why the HA proxy did not recognize they were down.
Updated by Roberto Cirillo over 7 years ago
- Status changed from Feedback to Closed
- Assignee changed from Gianpaolo Coro to Roberto Cirillo
This is not a haproxy problem. The hosts above were going in timeout when I was trying to connect via ssh to that VMs. I've run again the fix and now all the hosts have been restarted correctly.