Incident #6179: Infrastructure Gateway cluster communication not working properly - D4Science Infrastructure - D4science

Actions

Copy link

Incident #6179

closed

Infrastructure Gateway cluster communication not working properly

Added by Massimiliano Assante over 9 years ago. Updated over 9 years ago.

Status:

Closed

Priority:

High

Assignee:

Andrea Dell'Amico

Category:

Other

Target version:

UnSprintable

Start date:

Dec 12, 2016

Due date:

Dec 14, 2016

% Done:

100%

Estimated time:

Infrastructure:

Pre-Production, Production

Description

the communication in the Infrastructure Gateway cluster is not working properly, e.g. when a user is added in one node, the other nodes on the cluster are not informed anymore. This happens in pre and production.

Related issues

Actions

Copy link

Updated by Massimiliano Assante over 9 years ago

Status changed from New to In Progress

to limit the problems until we fix this, we must run on one single node only in production, I'm going to shutdown infra-gateway1 and leave up only infra-gateway

Actions

Copy link

Updated by Massimiliano Assante over 9 years ago

infra-gateway1 is down now until we understand where the problem is.

life@infra-gateway1:~$ ./stopContainer.sh 
liferay stop/waiting

Actions

Copy link

Updated by Andrea Dell'Amico over 9 years ago

I set up some other liferay property (from here):
https://docs.liferay.com/portal/6.2/propertiesdoc/portal.properties.html

portal.instance.protocol=
portal.instance.http.port=
portal.instance.https.port=

on both preprod1 and preprod2. I also changed a bit the tomcat cluster configuration. The result is the same, it seems: the login session is not moved from one server to the other after a tomcat restart. I don't know how to test any other behaviours. But I see the tomcat cluster communications in the logs, so it seems that the tomcat cluster part is working.
I've also seen this one: http://stackoverflow.com/questions/20468692/how-to-stop-overriding-portal-ext-file and the next question is: is there a way to know if any of the portlets are breaking the cluster configuration?

Actions

Copy link

Updated by Andrea Dell'Amico over 9 years ago

An interesting here: https://web.liferay.com/community/forums/-/message_boards/message/35359683
Our tomcat configuration seems correct, while the liferay application is not marked as 'distributable'.

Actions

Copy link

Updated by Massimiliano Assante over 9 years ago

% Done changed from 0 to 30

After performing several tests this morning we haven't found the exact cause yet.

It could be related to the firewall present on both nodes but we have a non-deterministic behaviour:

When the firewall is up, sometimes the communication occurs sometimes it doesn't (mostly don't). When the firewall is down the communication occurred all the time (so far).
We need more tests to understand where the problem is.

Actions

Copy link

Updated by Massimiliano Assante over 9 years ago

when I disable the firewall on both preprod1 and preprod2 I can see this in the log of preprod2 (not in preprod1 though)

Dec 14, 2016 10:06:05 AM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded
INFO: Replication member added:org.apache.catalina.tribes.membership.MemberImpl[tcp://{146, 48, 122, 240}:4000,{146, 48, 122, 240},4000, alive=57072890, securePort=-1, UDP Port=-1, id={-59 -67 -25 -88 -9 83 72 19 -112 -122 99 -65 -42 49 -46 126 }, payload={}, command={}, domain={}, ]

Actions

Copy link

Updated by Andrea Dell'Amico over 9 years ago

Hm. This should the normal behaviour also with the firewall enabled, and I did see the cluster nodes talk each other while testing the other day.

I see two possibilities:

We continue the testing, gathering the traffic and analyzing it;
We completely open the IP traffic between the tomcat nodes.

The second is OK with me.

Actions

Copy link

Updated by Massimiliano Assante over 9 years ago

% Done changed from 30 to 70

Here is the result of my tests this morning (only for liferay cluster ).

The tests consisted in one user on preprod1 asking to register to a VRE, and on the manager user accepting the request on preprod2 for 5 different users. The results have shown that the firewall is somehow interfering sometimes.

With Firewall UP

1st attempt worked as expected
2nd attempt worked as expected
3rd attempt did not work as expected: Join Request not arrived on preprod2, Accepted Request on preprod2 (After clearing liferay cluster cache manually) not reflected in preprod1
4th attempt did not work as expected: Join Request arrived on preprod2, Accepted Request on preprod2 not reflected in preprod1
5th attempt did not work as expected: Join Request not arrived on preprod2, Accepted Request on preprod2 (After clearing liferay cluster cache manually) not reflected in preprod1

With Firewall DOWN

1st attempt worked as expected
2nd attempt worked as expected
3rd attempt worked as expected
4th attempt worked as expected
5th attempt worked as expected

So, as long as we don't understand completely what are the ports and protocols used by liferay we should completely open the IP traffic between the tomcat nodes, at least this seems to make work the liferay cluster.

Actions

Copy link

Updated by Andrea Dell'Amico over 9 years ago

I just reviewed the firewall rules and they permit all the tcp traffic between the tomcat nodes (it was added because of some portlet I don't remember anymore).

If you don't mind trying another test, we could switch to a full unicast cluster configuration. That will need some reconfiguration each time a new node is added (an entry for each tomcat node is needed on the server.xml of every cluster node).

I can also confirm that if you don't add the <distributable /> tag to the lifereay's main web.xml, the users sessions cannot be migrated.

Actions

Copy link

#10

Updated by Massimiliano Assante over 9 years ago

Status changed from In Progress to Closed
% Done changed from 70 to 100

trying to dig more into http session replication I read carefully the post you linked (https://web.liferay.com/community/forums/-/message_boards/message/35359683) the guy there claims that also each single portlet in their web.xml needs to add the < distributable > tag.

Since there's no time to do this now, and to perform more tests on this I'm going to close this ticket considering the liferay cluster (with no firewall) an acceptable solution, although without HA.

I'm going to open a Feature ticket on gCube so that we can investigate and proper enable session replication in 2017.

Actions

Copy link

#13

Updated by Massimiliano Assante over 9 years ago

made some tests in production and everything worked as expected.

Actions

Copy link

#14

Updated by Andrea Dell'Amico over 9 years ago

Status changed from Closed to In Progress
Assignee changed from Massimiliano Assante to Andrea Dell'Amico
% Done changed from 100 to 50

The next move will be test the preproduction environment after opening the firewall to all the multicast networks.

If it will still refuse to work correctly I'll need an hour of exclusive usage of the portal to collect the network traffic while some clustered operations are ongoing.

Actions

Copy link

#15

Updated by Massimiliano Assante over 9 years ago

Related to Task #6232: Enable Infrastructure Gateway cluster communication behind firewall added

Actions

Copy link

#16

Updated by Massimiliano Assante over 9 years ago

Status changed from In Progress to Closed

ok, but let's close this incident and use this task to track the activity: https://support.d4science.org/issues/6232

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

D4Science Infrastructure

Custom queries

Incident #6179

Infrastructure Gateway cluster communication not working properly

Updated by Massimiliano Assante over 9 years ago

Updated by Massimiliano Assante over 9 years ago

Updated by Andrea Dell'Amico over 9 years ago

Updated by Andrea Dell'Amico over 9 years ago

Updated by Massimiliano Assante over 9 years ago

Updated by Massimiliano Assante over 9 years ago

Updated by Andrea Dell'Amico over 9 years ago

Updated by Massimiliano Assante over 9 years ago

Updated by Andrea Dell'Amico over 9 years ago

Updated by Massimiliano Assante over 9 years ago

Updated by Massimiliano Assante over 9 years ago

Updated by Andrea Dell'Amico over 9 years ago

Updated by Massimiliano Assante over 9 years ago

Updated by Massimiliano Assante over 9 years ago