Project

General

Profile

Actions

Task #11257

closed

DataMiners not responding

Added by Gianpaolo Coro about 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
High
Assignee:
_InfraScience Systems Engineer
Category:
-
Target version:
Start date:
Feb 21, 2018
Due date:
% Done:

100%

Estimated time:
Infrastructure:
Production

Description

There are 9 non-responding production DataMiners (I report the list with the corresponding getCapabilities):

ip-90-147-167-175.ct1.garrservices.it -> http://ip-90-147-167-175.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-177.ct1.garrservices.it -> http://ip-90-147-167-177.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-179.ct1.garrservices.it -> http://ip-90-147-167-179.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-182.ct1.garrservices.it -> http://ip-90-147-167-182.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-183.ct1.garrservices.it -> http://ip-90-147-167-183.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-166-23.ct1.garrservices.it -> http://ip-90-147-166-23.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-234.ct1.garrservices.it -> http://ip-90-147-167-234.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-237.ct1.garrservices.it -> http://ip-90-147-167-237.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
dm-192-168-100-13.garr.d4science.org -> http://dm-192-168-100-13.garr.d4science.org/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462

Could you please help me to understand what's happening, please?

Actions #1

Updated by Roberto Cirillo about 7 years ago

I've checked the following host: dm-192-168-100-13.garr.d4science.org

the container is not restarted correctly during the last restart performed by me this morning.

The ghn.log show the following:

2018-02-21 10:05:56,627 [localhost-startStop-1] WARN  ContainerManager: the token 9f6d573d-c881-4dfc-ab12-4bec4e07fa7a-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,660 [localhost-startStop-1] WARN  ContainerManager: the token 7de2d922-89c6-4145-8b90-52011e048379-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,688 [localhost-startStop-1] WARN  ContainerManager: the token b5754690-675f-4d1e-93de-dbd2f802140a-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,715 [localhost-startStop-1] WARN  ContainerManager: the token 58fbc1d8-95c5-4308-a955-5a85893b28d9-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,742 [localhost-startStop-1] WARN  ContainerManager: the token 255b8c63-6c26-4584-99d0-5476586780db-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,769 [localhost-startStop-1] WARN  ContainerManager: the token 7edf64cd-52b1-41f3-ae56-d91cb13ed51d-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,800 [localhost-startStop-1] WARN  ContainerManager: the token 8767fcb5-464d-4205-8d6e-752e67f83d1d-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,827 [localhost-startStop-1] WARN  ContainerManager: the token 8064225f-ef5a-4e41-9017-cf92722e3b2b-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,854 [localhost-startStop-1] WARN  ContainerManager: the token d29fdd99-0e88-4f13-ae33-f52258e9f578-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,854 [localhost-startStop-1] ERROR ContainerManager: no valid starting token are specified, moving the container to failed
2018-02-21 10:05:56,856 [localhost-startStop-1] ERROR ContainerManager: cannot manage container (see cause)
java.lang.RuntimeException: no valid starting token are specified
        at org.gcube.smartgears.managers.ContainerManager.validateContainer(ContainerManager.java:129)
        at org.gcube.smartgears.managers.ContainerManager.start(ContainerManager.java:71)
        at org.gcube.smartgears.Bootstrap.startContainerIfItHasntAlreadyFailed(Bootstrap.java:124)
        at org.gcube.smartgears.Bootstrap.<init>(Bootstrap.java:45)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at java.lang.Class.newInstance(Class.java:442)
        at org.apache.catalina.startup.WebappServiceLoader.loadServices(WebappServiceLoader.java:188)
        at org.apache.catalina.startup.WebappServiceLoader.load(WebappServiceLoader.java:152)
        at org.apache.catalina.startup.ContextConfig.processServletContainerInitializers(ContextConfig.java:1543)
        at org.apache.catalina.startup.ContextConfig.webConfig(ContextConfig.java:1265)
        at org.apache.catalina.startup.ContextConfig.configureStart(ContextConfig.java:873)
        at org.apache.catalina.startup.ContextConfig.lifecycleEvent(ContextConfig.java:371)
        at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
        at org.apache.catalina.util.LifecycleBase.fireLifecycleEvent(LifecycleBase.java:90)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5392)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:632)
        at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1073)
        at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1857)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2018-02-21 10:06:46,485 [localhost-startStop-1] ERROR ApplicationManager: error starting application data-transfer-service 
java.lang.RuntimeException: cannot create profile for data-transfer-service
        at org.gcube.smartgears.handlers.application.lifecycle.ProfileManager.create(ProfileManager.java:238)
        at org.gcube.smartgears.handlers.application.lifecycle.ProfileManager.loadOrCreateProfile(ProfileManager.java:219)
        at org.gcube.smartgears.handlers.application.lifecycle.ProfileManager.activated(ProfileManager.java:89)
        at org.gcube.smartgears.handlers.application.lifecycle.ProfileManager.onStart(ProfileManager.java:72)
        at org.gcube.smartgears.handlers.application.ApplicationLifecycleHandler.onEvent(ApplicationLifecycleHandler.java:42)
        at org.gcube.smartgears.handlers.application.ApplicationLifecycleHandler.onEvent(ApplicationLifecycleHandler.java:18)
        at org.gcube.smartgears.handlers.Pipeline.forward(Pipeline.java:65)
        at org.gcube.smartgears.managers.ApplicationManager.start(ApplicationManager.java:273)
        at org.gcube.smartgears.managers.ApplicationManager.start(ApplicationManager.java:120)
        at org.gcube.smartgears.Bootstrap.onStartup(Bootstrap.java:61)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5493)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:632)
        at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1073)
        at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1857)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: unknown property ghn-profile
        at org.gcube.smartgears.context.Properties.lookup(Properties.java:102)

Seems that the token contained on the container.xml were generated for the host: dataminer2 and this host is different from the host found on container.xml file:

<hostname>dm-192-168-100-13.garr.d4science.org</hostname>

This morning I've only restarted the container nothing changed on container.xml:

gcube@dataminer-2:~$ ls -als  SmartGears/container.xml
4 -r--r----- 1 gcube gcube 1656 Feb 12 14:23 Smart
Actions #2

Updated by Andrea Dell'Amico about 7 years ago

I've seen this behaviour in the past, on other VMs. I had to request the scopes again.

What's also worrying is that tomcat responds in a way that is accepted by the balancer as 'working fine'.

Actions #3

Updated by Gianpaolo Coro about 7 years ago

This reminds me an old issue of the GHN when the local state of the service went corrupted randomly..could it be related @lucio.lelii@isti.cnr.it ?

Actions #4

Updated by Roberto Cirillo about 7 years ago

Andrea Dell'Amico wrote:

I've seen this behaviour in the past, on other VMs. I had to request the scopes again.

I could try to re run the playbook in order to restore the right tokens but I see in /etc/hostname the following hostname: dataminer-2
Maybe we need to change the /etc/hostname file before re-run the playbook. @andrea.dellamico@isti.cnr.it what do you suggest?

Actions #5

Updated by Lucio Lelii about 7 years ago

@gianpaolo.coro@isti.cnr.it no, the problem is related to the token generation, the hostname in the Container token is dataminer-2 without domain.

Actions #6

Updated by Roberto Cirillo about 7 years ago

For the moment, I'm going to stop the container where this problem happens. In this way the new computation requests will be not turned to these VMs

Actions #7

Updated by Roberto Cirillo about 7 years ago

All the affected VMs have been stopped

Actions #8

Updated by Gianpaolo Coro about 7 years ago

Now the non-working DMs have become 10 (one more), all Garr machines:

ip-90-147-167-175.ct1.garrservices.it -> http://ip-90-147-167-175.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-177.ct1.garrservices.it -> http://ip-90-147-167-177.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-179.ct1.garrservices.it -> http://ip-90-147-167-179.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-182.ct1.garrservices.it -> http://ip-90-147-167-182.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-183.ct1.garrservices.it -> http://ip-90-147-167-183.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-166-24.ct1.garrservices.it -> http://ip-90-147-166-24.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-166-23.ct1.garrservices.it -> http://ip-90-147-166-23.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-234.ct1.garrservices.it -> http://ip-90-147-167-234.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-237.ct1.garrservices.it -> http://ip-90-147-167-237.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
dm-192-168-100-13.garr.d4science.org -> http://dm-192-168-100-13.garr.d4science.org/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462

Actions #9

Updated by Roberto Cirillo about 7 years ago

In the last list there is just one more VM than the previous list: ip-90-147-166-24.ct1.garrservices.it

On this VM the hostname is correct and the GetCapabilities request works well now: http://ip-90-147-166-24.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462

So this VM is working fine for me.

Btw, from ansible inventory, we have 25 garr vms so at this moment 16 vms are running properly.

Actions #10

Updated by Pasquale Pagano about 7 years ago

  • Priority changed from Normal to High
Actions #11

Updated by Gianpaolo Coro about 7 years ago

Just to alert that the shutdown GHNs behind the "dataminer_cloud1" production cluster are still receiving requests and indeed they seem to be up on the HA proxy interface. The machines are:

ip-90-147-166-23.ct1.garrservices.it, ip-90-147-167-237.ct1.garrservices.it, dm-192-168-100-13.garr.d4science.org, ip-90-147-167-234.ct1.garrservices.it

Actions #12

Updated by Roberto Cirillo about 7 years ago

Gianpaolo Coro wrote:

Just to alert that the shutdown GHNs behind the "dataminer_cloud1" production cluster are still receiving requests and indeed they seem to be up on the HA proxy interface. The machines are:

ip-90-147-166-23.ct1.garrservices.it, ip-90-147-167-237.ct1.garrservices.it, dm-192-168-100-13.garr.d4science.org, ip-90-147-167-234.ct1.garrservices.it

"dataminer_cloud1" cluster doesn't exist. Now we have only dataminer.garr cluster.

The containers above are stopped. Where have you seen these requests? Could you proof them?

Actions #13

Updated by Andrea Dell'Amico about 7 years ago

  • Status changed from New to In Progress

I understood what happened
When a GARR VM restarts the hostname is changed by the dhcp server. And when we executed the upgrade jobs, the scopes where requested with the wrong hostname because we did not run the tasks that change the hostname to the one we want.

I just fixed the script that requests the scopes so that it always requests the tokens for the correct hostname. I'm also going to run the tasks that fix the hostname (the playbook tag is set_hostname, FYI).

Actions #14

Updated by Gianpaolo Coro about 7 years ago

Great! As for the cluster with the "ghost" DMs, I meant the "dataminer_cloud1" group specified here http://dataminer-lb.garr.d4science.org:8880/

Actions #15

Updated by Andrea Dell'Amico about 7 years ago

  • Status changed from In Progress to Feedback
  • % Done changed from 0 to 100

I just requested new scopes for all the GARR dataminer, tomcat has been restarted on all hosts (on some of them is still starting):

192.168.100.13             : ok=18   changed=6    unreachable=0    failed=0
192.168.100.14             : ok=18   changed=5    unreachable=0    failed=0
192.168.100.19             : ok=18   changed=5    unreachable=0    failed=0
192.168.100.3              : ok=18   changed=5    unreachable=0    failed=0
192.168.100.6              : ok=18   changed=5    unreachable=0    failed=0
192.168.100.7              : ok=18   changed=5    unreachable=0    failed=0
192.168.100.8              : ok=18   changed=5    unreachable=0    failed=0
192.168.100.9              : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-166-23.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-166-24.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-173.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0
ip-90-147-167-175.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0
ip-90-147-167-176.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-177.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0
ip-90-147-167-178.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-179.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0
ip-90-147-167-180.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-181.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-182.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0
ip-90-147-167-183.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0
ip-90-147-167-222.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-230.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-234.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-236.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-237.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0

Gianpaolo Coro wrote:

Great! As for the cluster with the "ghost" DMs, I meant the "dataminer_cloud1" group specified here http://dataminer-lb.garr.d4science.org:8880/

@roberto.cirillo@isti.cnr.it, do you remember that I reinstated that cluster at the time the GARR instances were stopped for maintainance a couple of weeks ago, sto that if another emergency happens you can switch the service endpoint and we can still have the service working with the CNR dataminers only?

Actions #16

Updated by Gianpaolo Coro about 7 years ago

  • Status changed from Feedback to Closed

The DMs are working.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 8.91 MB)