Task #11257
closed
DataMiners not responding
100%
Description
There are 9 non-responding production DataMiners (I report the list with the corresponding getCapabilities):
ip-90-147-167-175.ct1.garrservices.it -> http://ip-90-147-167-175.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-167-177.ct1.garrservices.it -> http://ip-90-147-167-177.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-167-179.ct1.garrservices.it -> http://ip-90-147-167-179.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-167-182.ct1.garrservices.it -> http://ip-90-147-167-182.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-167-183.ct1.garrservices.it -> http://ip-90-147-167-183.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-166-23.ct1.garrservices.it -> http://ip-90-147-166-23.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-167-234.ct1.garrservices.it -> http://ip-90-147-167-234.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-167-237.ct1.garrservices.it -> http://ip-90-147-167-237.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 dm-192-168-100-13.garr.d4science.org -> http://dm-192-168-100-13.garr.d4science.org/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
Could you please help me to understand what's happening, please?
Updated by Roberto Cirillo about 7 years ago
I've checked the following host: dm-192-168-100-13.garr.d4science.org
the container is not restarted correctly during the last restart performed by me this morning.
The ghn.log show the following:
2018-02-21 10:05:56,627 [localhost-startStop-1] WARN ContainerManager: the token 9f6d573d-c881-4dfc-ab12-4bec4e07fa7a-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container 2018-02-21 10:05:56,660 [localhost-startStop-1] WARN ContainerManager: the token 7de2d922-89c6-4145-8b90-52011e048379-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container 2018-02-21 10:05:56,688 [localhost-startStop-1] WARN ContainerManager: the token b5754690-675f-4d1e-93de-dbd2f802140a-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container 2018-02-21 10:05:56,715 [localhost-startStop-1] WARN ContainerManager: the token 58fbc1d8-95c5-4308-a955-5a85893b28d9-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container 2018-02-21 10:05:56,742 [localhost-startStop-1] WARN ContainerManager: the token 255b8c63-6c26-4584-99d0-5476586780db-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container 2018-02-21 10:05:56,769 [localhost-startStop-1] WARN ContainerManager: the token 7edf64cd-52b1-41f3-ae56-d91cb13ed51d-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container 2018-02-21 10:05:56,800 [localhost-startStop-1] WARN ContainerManager: the token 8767fcb5-464d-4205-8d6e-752e67f83d1d-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container 2018-02-21 10:05:56,827 [localhost-startStop-1] WARN ContainerManager: the token 8064225f-ef5a-4e41-9017-cf92722e3b2b-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container 2018-02-21 10:05:56,854 [localhost-startStop-1] WARN ContainerManager: the token d29fdd99-0e88-4f13-ae33-f52258e9f578-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container 2018-02-21 10:05:56,854 [localhost-startStop-1] ERROR ContainerManager: no valid starting token are specified, moving the container to failed 2018-02-21 10:05:56,856 [localhost-startStop-1] ERROR ContainerManager: cannot manage container (see cause) java.lang.RuntimeException: no valid starting token are specified at org.gcube.smartgears.managers.ContainerManager.validateContainer(ContainerManager.java:129) at org.gcube.smartgears.managers.ContainerManager.start(ContainerManager.java:71) at org.gcube.smartgears.Bootstrap.startContainerIfItHasntAlreadyFailed(Bootstrap.java:124) at org.gcube.smartgears.Bootstrap.<init>(Bootstrap.java:45) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at org.apache.catalina.startup.WebappServiceLoader.loadServices(WebappServiceLoader.java:188) at org.apache.catalina.startup.WebappServiceLoader.load(WebappServiceLoader.java:152) at org.apache.catalina.startup.ContextConfig.processServletContainerInitializers(ContextConfig.java:1543) at org.apache.catalina.startup.ContextConfig.webConfig(ContextConfig.java:1265) at org.apache.catalina.startup.ContextConfig.configureStart(ContextConfig.java:873) at org.apache.catalina.startup.ContextConfig.lifecycleEvent(ContextConfig.java:371) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117) at org.apache.catalina.util.LifecycleBase.fireLifecycleEvent(LifecycleBase.java:90) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5392) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:632) at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1073) at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1857) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2018-02-21 10:06:46,485 [localhost-startStop-1] ERROR ApplicationManager: error starting application data-transfer-service java.lang.RuntimeException: cannot create profile for data-transfer-service at org.gcube.smartgears.handlers.application.lifecycle.ProfileManager.create(ProfileManager.java:238) at org.gcube.smartgears.handlers.application.lifecycle.ProfileManager.loadOrCreateProfile(ProfileManager.java:219) at org.gcube.smartgears.handlers.application.lifecycle.ProfileManager.activated(ProfileManager.java:89) at org.gcube.smartgears.handlers.application.lifecycle.ProfileManager.onStart(ProfileManager.java:72) at org.gcube.smartgears.handlers.application.ApplicationLifecycleHandler.onEvent(ApplicationLifecycleHandler.java:42) at org.gcube.smartgears.handlers.application.ApplicationLifecycleHandler.onEvent(ApplicationLifecycleHandler.java:18) at org.gcube.smartgears.handlers.Pipeline.forward(Pipeline.java:65) at org.gcube.smartgears.managers.ApplicationManager.start(ApplicationManager.java:273) at org.gcube.smartgears.managers.ApplicationManager.start(ApplicationManager.java:120) at org.gcube.smartgears.Bootstrap.onStartup(Bootstrap.java:61) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5493) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:632) at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1073) at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1857) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalStateException: unknown property ghn-profile at org.gcube.smartgears.context.Properties.lookup(Properties.java:102)
Seems that the token contained on the container.xml were generated for the host: dataminer2 and this host is different from the host found on container.xml file:
<hostname>dm-192-168-100-13.garr.d4science.org</hostname>
This morning I've only restarted the container nothing changed on container.xml:
gcube@dataminer-2:~$ ls -als SmartGears/container.xml 4 -r--r----- 1 gcube gcube 1656 Feb 12 14:23 Smart
Updated by Andrea Dell'Amico about 7 years ago
I've seen this behaviour in the past, on other VMs. I had to request the scopes again.
What's also worrying is that tomcat responds in a way that is accepted by the balancer as 'working fine'.
Updated by Gianpaolo Coro about 7 years ago
This reminds me an old issue of the GHN when the local state of the service went corrupted randomly..could it be related @lucio.lelii@isti.cnr.it ?
Updated by Roberto Cirillo about 7 years ago
Andrea Dell'Amico wrote:
I've seen this behaviour in the past, on other VMs. I had to request the scopes again.
I could try to re run the playbook in order to restore the right tokens but I see in /etc/hostname the following hostname: dataminer-2
Maybe we need to change the /etc/hostname file before re-run the playbook. @andrea.dellamico@isti.cnr.it what do you suggest?
Updated by Lucio Lelii about 7 years ago
@gianpaolo.coro@isti.cnr.it no, the problem is related to the token generation, the hostname in the Container token is dataminer-2 without domain.
Updated by Roberto Cirillo about 7 years ago
For the moment, I'm going to stop the container where this problem happens. In this way the new computation requests will be not turned to these VMs
Updated by Roberto Cirillo about 7 years ago
All the affected VMs have been stopped
Updated by Gianpaolo Coro about 7 years ago
Now the non-working DMs have become 10 (one more), all Garr machines:
ip-90-147-167-175.ct1.garrservices.it -> http://ip-90-147-167-175.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-167-177.ct1.garrservices.it -> http://ip-90-147-167-177.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-167-179.ct1.garrservices.it -> http://ip-90-147-167-179.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-167-182.ct1.garrservices.it -> http://ip-90-147-167-182.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-167-183.ct1.garrservices.it -> http://ip-90-147-167-183.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-166-24.ct1.garrservices.it -> http://ip-90-147-166-24.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-166-23.ct1.garrservices.it -> http://ip-90-147-166-23.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-167-234.ct1.garrservices.it -> http://ip-90-147-167-234.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 ip-90-147-167-237.ct1.garrservices.it -> http://ip-90-147-167-237.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462 dm-192-168-100-13.garr.d4science.org -> http://dm-192-168-100-13.garr.d4science.org/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
Updated by Roberto Cirillo about 7 years ago
In the last list there is just one more VM than the previous list: ip-90-147-166-24.ct1.garrservices.it
On this VM the hostname is correct and the GetCapabilities request works well now: http://ip-90-147-166-24.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
So this VM is working fine for me.
Btw, from ansible inventory, we have 25 garr vms so at this moment 16 vms are running properly.
Updated by Pasquale Pagano about 7 years ago
- Priority changed from Normal to High
Updated by Gianpaolo Coro about 7 years ago
Just to alert that the shutdown GHNs behind the "dataminer_cloud1" production cluster are still receiving requests and indeed they seem to be up on the HA proxy interface. The machines are:
ip-90-147-166-23.ct1.garrservices.it, ip-90-147-167-237.ct1.garrservices.it, dm-192-168-100-13.garr.d4science.org, ip-90-147-167-234.ct1.garrservices.it
Updated by Roberto Cirillo about 7 years ago
Gianpaolo Coro wrote:
Just to alert that the shutdown GHNs behind the "dataminer_cloud1" production cluster are still receiving requests and indeed they seem to be up on the HA proxy interface. The machines are:
ip-90-147-166-23.ct1.garrservices.it, ip-90-147-167-237.ct1.garrservices.it, dm-192-168-100-13.garr.d4science.org, ip-90-147-167-234.ct1.garrservices.it
"dataminer_cloud1" cluster doesn't exist. Now we have only dataminer.garr cluster.
The containers above are stopped. Where have you seen these requests? Could you proof them?
Updated by Andrea Dell'Amico about 7 years ago
- Status changed from New to In Progress
I understood what happened
When a GARR VM restarts the hostname is changed by the dhcp server. And when we executed the upgrade jobs, the scopes where requested with the wrong hostname because we did not run the tasks that change the hostname to the one we want.
I just fixed the script that requests the scopes so that it always requests the tokens for the correct hostname. I'm also going to run the tasks that fix the hostname (the playbook tag is set_hostname
, FYI).
Updated by Gianpaolo Coro about 7 years ago
Great! As for the cluster with the "ghost" DMs, I meant the "dataminer_cloud1" group specified here http://dataminer-lb.garr.d4science.org:8880/
Updated by Andrea Dell'Amico about 7 years ago
- Status changed from In Progress to Feedback
- % Done changed from 0 to 100
I just requested new scopes for all the GARR dataminer, tomcat has been restarted on all hosts (on some of them is still starting):
192.168.100.13 : ok=18 changed=6 unreachable=0 failed=0 192.168.100.14 : ok=18 changed=5 unreachable=0 failed=0 192.168.100.19 : ok=18 changed=5 unreachable=0 failed=0 192.168.100.3 : ok=18 changed=5 unreachable=0 failed=0 192.168.100.6 : ok=18 changed=5 unreachable=0 failed=0 192.168.100.7 : ok=18 changed=5 unreachable=0 failed=0 192.168.100.8 : ok=18 changed=5 unreachable=0 failed=0 192.168.100.9 : ok=18 changed=5 unreachable=0 failed=0 ip-90-147-166-23.ct1.garrservices.it : ok=18 changed=5 unreachable=0 failed=0 ip-90-147-166-24.ct1.garrservices.it : ok=18 changed=5 unreachable=0 failed=0 ip-90-147-167-173.ct1.garrservices.it : ok=18 changed=6 unreachable=0 failed=0 ip-90-147-167-175.ct1.garrservices.it : ok=18 changed=6 unreachable=0 failed=0 ip-90-147-167-176.ct1.garrservices.it : ok=18 changed=5 unreachable=0 failed=0 ip-90-147-167-177.ct1.garrservices.it : ok=18 changed=6 unreachable=0 failed=0 ip-90-147-167-178.ct1.garrservices.it : ok=18 changed=5 unreachable=0 failed=0 ip-90-147-167-179.ct1.garrservices.it : ok=18 changed=6 unreachable=0 failed=0 ip-90-147-167-180.ct1.garrservices.it : ok=18 changed=5 unreachable=0 failed=0 ip-90-147-167-181.ct1.garrservices.it : ok=18 changed=5 unreachable=0 failed=0 ip-90-147-167-182.ct1.garrservices.it : ok=18 changed=6 unreachable=0 failed=0 ip-90-147-167-183.ct1.garrservices.it : ok=18 changed=6 unreachable=0 failed=0 ip-90-147-167-222.ct1.garrservices.it : ok=18 changed=5 unreachable=0 failed=0 ip-90-147-167-230.ct1.garrservices.it : ok=18 changed=5 unreachable=0 failed=0 ip-90-147-167-234.ct1.garrservices.it : ok=18 changed=5 unreachable=0 failed=0 ip-90-147-167-236.ct1.garrservices.it : ok=18 changed=5 unreachable=0 failed=0 ip-90-147-167-237.ct1.garrservices.it : ok=18 changed=6 unreachable=0 failed=0
Gianpaolo Coro wrote:
Great! As for the cluster with the "ghost" DMs, I meant the "dataminer_cloud1" group specified here http://dataminer-lb.garr.d4science.org:8880/
@roberto.cirillo@isti.cnr.it, do you remember that I reinstated that cluster at the time the GARR instances were stopped for maintainance a couple of weeks ago, sto that if another emergency happens you can switch the service endpoint and we can still have the service working with the CNR dataminers only?
Updated by Gianpaolo Coro about 7 years ago
- Status changed from Feedback to Closed
The DMs are working.