Task #11257: DataMiners not responding - D4Science Infrastructure - D4science

Actions

Copy link

Task #11257

closed

DataMiners not responding

Added by Gianpaolo Coro over 7 years ago. Updated over 7 years ago.

Status:

Closed

Priority:

High

Assignee:

_InfraScience Systems Engineer

Category:

Target version:

Data Processing

Start date:

Feb 21, 2018

Due date:

% Done:

100%

Estimated time:

Infrastructure:

Production

Description

There are 9 non-responding production DataMiners (I report the list with the corresponding getCapabilities):

ip-90-147-167-175.ct1.garrservices.it -> http://ip-90-147-167-175.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-177.ct1.garrservices.it -> http://ip-90-147-167-177.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-179.ct1.garrservices.it -> http://ip-90-147-167-179.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-182.ct1.garrservices.it -> http://ip-90-147-167-182.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-183.ct1.garrservices.it -> http://ip-90-147-167-183.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-166-23.ct1.garrservices.it -> http://ip-90-147-166-23.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-234.ct1.garrservices.it -> http://ip-90-147-167-234.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-237.ct1.garrservices.it -> http://ip-90-147-167-237.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
dm-192-168-100-13.garr.d4science.org -> http://dm-192-168-100-13.garr.d4science.org/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462

Could you please help me to understand what's happening, please?

Actions

Copy link

Updated by Roberto Cirillo over 7 years ago

I've checked the following host: dm-192-168-100-13.garr.d4science.org

the container is not restarted correctly during the last restart performed by me this morning.

The ghn.log show the following:

2018-02-21 10:05:56,627 [localhost-startStop-1] WARN  ContainerManager: the token 9f6d573d-c881-4dfc-ab12-4bec4e07fa7a-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,660 [localhost-startStop-1] WARN  ContainerManager: the token 7de2d922-89c6-4145-8b90-52011e048379-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,688 [localhost-startStop-1] WARN  ContainerManager: the token b5754690-675f-4d1e-93de-dbd2f802140a-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,715 [localhost-startStop-1] WARN  ContainerManager: the token 58fbc1d8-95c5-4308-a955-5a85893b28d9-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,742 [localhost-startStop-1] WARN  ContainerManager: the token 255b8c63-6c26-4584-99d0-5476586780db-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,769 [localhost-startStop-1] WARN  ContainerManager: the token 7edf64cd-52b1-41f3-ae56-d91cb13ed51d-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,800 [localhost-startStop-1] WARN  ContainerManager: the token 8767fcb5-464d-4205-8d6e-752e67f83d1d-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,827 [localhost-startStop-1] WARN  ContainerManager: the token 8064225f-ef5a-4e41-9017-cf92722e3b2b-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,854 [localhost-startStop-1] WARN  ContainerManager: the token d29fdd99-0e88-4f13-ae33-f52258e9f578-843339462 cannot be used, the client id dataminer-2:80 resolved with the token is not the same of the one specified in this container  
2018-02-21 10:05:56,854 [localhost-startStop-1] ERROR ContainerManager: no valid starting token are specified, moving the container to failed
2018-02-21 10:05:56,856 [localhost-startStop-1] ERROR ContainerManager: cannot manage container (see cause)
java.lang.RuntimeException: no valid starting token are specified
        at org.gcube.smartgears.managers.ContainerManager.validateContainer(ContainerManager.java:129)
        at org.gcube.smartgears.managers.ContainerManager.start(ContainerManager.java:71)
        at org.gcube.smartgears.Bootstrap.startContainerIfItHasntAlreadyFailed(Bootstrap.java:124)
        at org.gcube.smartgears.Bootstrap.<init>(Bootstrap.java:45)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at java.lang.Class.newInstance(Class.java:442)
        at org.apache.catalina.startup.WebappServiceLoader.loadServices(WebappServiceLoader.java:188)
        at org.apache.catalina.startup.WebappServiceLoader.load(WebappServiceLoader.java:152)
        at org.apache.catalina.startup.ContextConfig.processServletContainerInitializers(ContextConfig.java:1543)
        at org.apache.catalina.startup.ContextConfig.webConfig(ContextConfig.java:1265)
        at org.apache.catalina.startup.ContextConfig.configureStart(ContextConfig.java:873)
        at org.apache.catalina.startup.ContextConfig.lifecycleEvent(ContextConfig.java:371)
        at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
        at org.apache.catalina.util.LifecycleBase.fireLifecycleEvent(LifecycleBase.java:90)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5392)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:632)
        at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1073)
        at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1857)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2018-02-21 10:06:46,485 [localhost-startStop-1] ERROR ApplicationManager: error starting application data-transfer-service 
java.lang.RuntimeException: cannot create profile for data-transfer-service
        at org.gcube.smartgears.handlers.application.lifecycle.ProfileManager.create(ProfileManager.java:238)
        at org.gcube.smartgears.handlers.application.lifecycle.ProfileManager.loadOrCreateProfile(ProfileManager.java:219)
        at org.gcube.smartgears.handlers.application.lifecycle.ProfileManager.activated(ProfileManager.java:89)
        at org.gcube.smartgears.handlers.application.lifecycle.ProfileManager.onStart(ProfileManager.java:72)
        at org.gcube.smartgears.handlers.application.ApplicationLifecycleHandler.onEvent(ApplicationLifecycleHandler.java:42)
        at org.gcube.smartgears.handlers.application.ApplicationLifecycleHandler.onEvent(ApplicationLifecycleHandler.java:18)
        at org.gcube.smartgears.handlers.Pipeline.forward(Pipeline.java:65)
        at org.gcube.smartgears.managers.ApplicationManager.start(ApplicationManager.java:273)
        at org.gcube.smartgears.managers.ApplicationManager.start(ApplicationManager.java:120)
        at org.gcube.smartgears.Bootstrap.onStartup(Bootstrap.java:61)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5493)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:632)
        at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1073)
        at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1857)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: unknown property ghn-profile
        at org.gcube.smartgears.context.Properties.lookup(Properties.java:102)

Seems that the token contained on the container.xml were generated for the host: dataminer2 and this host is different from the host found on container.xml file:

<hostname>dm-192-168-100-13.garr.d4science.org</hostname>

This morning I've only restarted the container nothing changed on container.xml:

gcube@dataminer-2:~$ ls -als  SmartGears/container.xml
4 -r--r----- 1 gcube gcube 1656 Feb 12 14:23 Smart

Actions

Copy link

Updated by Andrea Dell'Amico over 7 years ago

I've seen this behaviour in the past, on other VMs. I had to request the scopes again.

What's also worrying is that tomcat responds in a way that is accepted by the balancer as 'working fine'.

Actions

Copy link

Updated by Gianpaolo Coro over 7 years ago

This reminds me an old issue of the GHN when the local state of the service went corrupted randomly..could it be related @lucio.lelii@isti.cnr.it ?

Actions

Copy link

Updated by Roberto Cirillo over 7 years ago

Andrea Dell'Amico wrote:

I've seen this behaviour in the past, on other VMs. I had to request the scopes again.

I could try to re run the playbook in order to restore the right tokens but I see in /etc/hostname the following hostname: dataminer-2
Maybe we need to change the /etc/hostname file before re-run the playbook. @andrea.dellamico@isti.cnr.it what do you suggest?

Actions

Copy link

Updated by Lucio Lelii over 7 years ago

@gianpaolo.coro@isti.cnr.it no, the problem is related to the token generation, the hostname in the Container token is dataminer-2 without domain.

Actions

Copy link

Updated by Roberto Cirillo over 7 years ago

For the moment, I'm going to stop the container where this problem happens. In this way the new computation requests will be not turned to these VMs

Actions

Copy link

Updated by Roberto Cirillo over 7 years ago

All the affected VMs have been stopped

Actions

Copy link

Updated by Gianpaolo Coro over 7 years ago

Now the non-working DMs have become 10 (one more), all Garr machines:

ip-90-147-167-175.ct1.garrservices.it -> http://ip-90-147-167-175.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-177.ct1.garrservices.it -> http://ip-90-147-167-177.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-179.ct1.garrservices.it -> http://ip-90-147-167-179.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-182.ct1.garrservices.it -> http://ip-90-147-167-182.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-183.ct1.garrservices.it -> http://ip-90-147-167-183.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-166-24.ct1.garrservices.it -> http://ip-90-147-166-24.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-166-23.ct1.garrservices.it -> http://ip-90-147-166-23.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-234.ct1.garrservices.it -> http://ip-90-147-167-234.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
ip-90-147-167-237.ct1.garrservices.it -> http://ip-90-147-167-237.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462
dm-192-168-100-13.garr.d4science.org -> http://dm-192-168-100-13.garr.d4science.org/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462

Actions

Copy link

Updated by Roberto Cirillo over 7 years ago

In the last list there is just one more VM than the previous list: ip-90-147-166-24.ct1.garrservices.it

On this VM the hostname is correct and the GetCapabilities request works well now: http://ip-90-147-166-24.ct1.garrservices.it/wps/WebProcessingService?Request=GetCapabilities&Service=WPS&gcube-token=48f869e2-b924-4b72-9541-272ddd3aeafb-843339462

So this VM is working fine for me.

Btw, from ansible inventory, we have 25 garr vms so at this moment 16 vms are running properly.

Actions

Copy link

#10

Updated by Pasquale Pagano over 7 years ago

Priority changed from Normal to High

Actions

Copy link

#11

Updated by Gianpaolo Coro over 7 years ago

Just to alert that the shutdown GHNs behind the "dataminer_cloud1" production cluster are still receiving requests and indeed they seem to be up on the HA proxy interface. The machines are:

ip-90-147-166-23.ct1.garrservices.it, ip-90-147-167-237.ct1.garrservices.it, dm-192-168-100-13.garr.d4science.org, ip-90-147-167-234.ct1.garrservices.it

Actions

Copy link

#12

Updated by Roberto Cirillo over 7 years ago

Gianpaolo Coro wrote:

Just to alert that the shutdown GHNs behind the "dataminer_cloud1" production cluster are still receiving requests and indeed they seem to be up on the HA proxy interface. The machines are:

ip-90-147-166-23.ct1.garrservices.it, ip-90-147-167-237.ct1.garrservices.it, dm-192-168-100-13.garr.d4science.org, ip-90-147-167-234.ct1.garrservices.it

"dataminer_cloud1" cluster doesn't exist. Now we have only dataminer.garr cluster.

The containers above are stopped. Where have you seen these requests? Could you proof them?

Actions

Copy link

#13

Updated by Andrea Dell'Amico over 7 years ago

Status changed from New to In Progress

I understood what happened
When a GARR VM restarts the hostname is changed by the dhcp server. And when we executed the upgrade jobs, the scopes where requested with the wrong hostname because we did not run the tasks that change the hostname to the one we want.

I just fixed the script that requests the scopes so that it always requests the tokens for the correct hostname. I'm also going to run the tasks that fix the hostname (the playbook tag is set_hostname, FYI).

Actions

Copy link

#14

Updated by Gianpaolo Coro over 7 years ago

Great! As for the cluster with the "ghost" DMs, I meant the "dataminer_cloud1" group specified here http://dataminer-lb.garr.d4science.org:8880/

Actions

Copy link

#15

Updated by Andrea Dell'Amico over 7 years ago

Status changed from In Progress to Feedback
% Done changed from 0 to 100

I just requested new scopes for all the GARR dataminer, tomcat has been restarted on all hosts (on some of them is still starting):

192.168.100.13             : ok=18   changed=6    unreachable=0    failed=0
192.168.100.14             : ok=18   changed=5    unreachable=0    failed=0
192.168.100.19             : ok=18   changed=5    unreachable=0    failed=0
192.168.100.3              : ok=18   changed=5    unreachable=0    failed=0
192.168.100.6              : ok=18   changed=5    unreachable=0    failed=0
192.168.100.7              : ok=18   changed=5    unreachable=0    failed=0
192.168.100.8              : ok=18   changed=5    unreachable=0    failed=0
192.168.100.9              : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-166-23.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-166-24.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-173.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0
ip-90-147-167-175.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0
ip-90-147-167-176.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-177.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0
ip-90-147-167-178.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-179.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0
ip-90-147-167-180.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-181.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-182.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0
ip-90-147-167-183.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0
ip-90-147-167-222.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-230.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-234.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-236.ct1.garrservices.it : ok=18   changed=5    unreachable=0    failed=0
ip-90-147-167-237.ct1.garrservices.it : ok=18   changed=6    unreachable=0    failed=0

Gianpaolo Coro wrote:

Great! As for the cluster with the "ghost" DMs, I meant the "dataminer_cloud1" group specified here http://dataminer-lb.garr.d4science.org:8880/

@roberto.cirillo@isti.cnr.it, do you remember that I reinstated that cluster at the time the GARR instances were stopped for maintainance a couple of weeks ago, sto that if another emergency happens you can switch the service endpoint and we can still have the service working with the CNR dataminers only?

Actions

Copy link

#16

Updated by Gianpaolo Coro over 7 years ago

Status changed from Feedback to Closed

The DMs are working.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

D4Science Infrastructure

Custom queries

Task #11257

DataMiners not responding

Updated by Roberto Cirillo over 7 years ago

Updated by Andrea Dell'Amico over 7 years ago

Updated by Gianpaolo Coro over 7 years ago

Updated by Roberto Cirillo over 7 years ago

Updated by Lucio Lelii over 7 years ago

Updated by Roberto Cirillo over 7 years ago

Updated by Roberto Cirillo over 7 years ago

Updated by Gianpaolo Coro over 7 years ago

Updated by Roberto Cirillo over 7 years ago

Updated by Pasquale Pagano over 7 years ago

Updated by Gianpaolo Coro over 7 years ago

Updated by Roberto Cirillo over 7 years ago

Updated by Andrea Dell'Amico over 7 years ago

Updated by Gianpaolo Coro over 7 years ago

Updated by Andrea Dell'Amico over 7 years ago

Updated by Gianpaolo Coro over 7 years ago