Task #11241
closed
Added by Gianpaolo Coro over 7 years ago.
Updated over 7 years ago.
Category:
High-Throughput-Computing
Infrastructure:
Production
Description
The production cluster services should be restarted in order to stop all the DataMiner jobs. Further, files in the ecocfg folder starting with "ar_bigdata*" should be deleted. Some DMs do not respond with GetCapabilities information for some reason but they are not excluded from the cluster. Usually, this happens if some algorithm library is not correctly installed.
However, the main reason of this restart is that I found some issues in the Argo NetCDF processing code and reduced the computation time of 70%, thus I would run the computations again.
- Status changed from New to In Progress
the fix is running right now
- Status changed from In Progress to Feedback
- Assignee changed from Roberto Cirillo to Gianpaolo Coro
- % Done changed from 0 to 100
Done. Some garr hosts were unreachable. This is the ansible recap:
192.168.100.13 : ok=3 changed=2 unreachable=0 failed=0
192.168.100.14 : ok=0 changed=0 unreachable=1 failed=0
192.168.100.19 : ok=3 changed=2 unreachable=0 failed=0
192.168.100.3 : ok=0 changed=0 unreachable=1 failed=0
192.168.100.6 : ok=0 changed=0 unreachable=1 failed=0
192.168.100.7 : ok=0 changed=0 unreachable=1 failed=0
192.168.100.8 : ok=0 changed=0 unreachable=1 failed=0
192.168.100.9 : ok=0 changed=0 unreachable=1 failed=0
dataminer-proto-ghost.d4science.org : ok=3 changed=2 unreachable=0 failed=0
dataminer0-proto.d4science.org : ok=3 changed=2 unreachable=0 failed=0
dataminer1-p-d4s.d4science.org : ok=3 changed=2 unreachable=0 failed=0
dataminer1-proto.d4science.org : ok=3 changed=2 unreachable=0 failed=0
dataminer2-p-d4s.d4science.org : ok=3 changed=2 unreachable=0 failed=0
dataminer2-proto.d4science.org : ok=3 changed=2 unreachable=0 failed=0
dataminer3-genericworkers.d4science.org : ok=3 changed=2 unreachable=0 failed=0
dataminer3-p-d4s.d4science.org : ok=3 changed=2 unreachable=0 failed=0
dataminer3-proto.d4science.org : ok=3 changed=2 unreachable=0 failed=0
dataminer4-p-d4s.d4science.org : ok=3 changed=2 unreachable=0 failed=0
dataminer4-proto.d4science.org : ok=3 changed=2 unreachable=0 failed=0
dataminer5-p-d4s.d4science.org : ok=3 changed=2 unreachable=0 failed=0
dataminer5-proto.d4science.org : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-166-23.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-166-24.ct1.garrservices.it : ok=0 changed=0 unreachable=1 failed=0
ip-90-147-167-173.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-175.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-176.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-177.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-178.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-179.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-180.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-181.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-182.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-183.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-222.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-230.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-234.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-236.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
ip-90-147-167-237.ct1.garrservices.it : ok=3 changed=2 unreachable=0 failed=0
I don't understand why the HA proxy did not recognize they were down.
- Status changed from Feedback to Closed
- Assignee changed from Gianpaolo Coro to Roberto Cirillo
This is not a haproxy problem. The hosts above were going in timeout when I was trying to connect via ssh to that VMs. I've run again the fix and now all the hosts have been restarted correctly.
Also available in: Atom
PDF