Incident #6179
closedInfrastructure Gateway cluster communication not working properly
100%
Description
the communication in the Infrastructure Gateway cluster is not working properly, e.g. when a user is added in one node, the other nodes on the cluster are not informed anymore. This happens in pre and production.
Related issues
       Updated by Massimiliano Assante almost 9 years ago
      Updated by Massimiliano Assante almost 9 years ago
      
    
    - Status changed from New to In Progress
to limit the problems until we fix this, we must run on one single node only in production, I'm going to shutdown infra-gateway1 and leave up only infra-gateway
       Updated by Massimiliano Assante almost 9 years ago
      Updated by Massimiliano Assante almost 9 years ago
      
    
    infra-gateway1 is down now until we understand where the problem is.
life@infra-gateway1:~$ ./stopContainer.sh liferay stop/waiting
       Updated by Andrea Dell'Amico almost 9 years ago
      Updated by Andrea Dell'Amico almost 9 years ago
      
    
    I set up some other liferay property (from here):
https://docs.liferay.com/portal/6.2/propertiesdoc/portal.properties.html
portal.instance.protocol= portal.instance.http.port= portal.instance.https.port=
on both preprod1 and preprod2. I also changed a bit the tomcat cluster configuration. The result is the same, it seems: the login session is not moved from one server to the other after a tomcat restart. I don't know how to test any other behaviours. But I see the tomcat cluster communications in the logs, so it seems that the tomcat cluster part is working.
I've also seen this one: http://stackoverflow.com/questions/20468692/how-to-stop-overriding-portal-ext-file and the next question is: is there a way to know if any of the portlets are breaking the cluster configuration?
       Updated by Andrea Dell'Amico almost 9 years ago
      Updated by Andrea Dell'Amico almost 9 years ago
      
    
    An interesting here: https://web.liferay.com/community/forums/-/message_boards/message/35359683
Our tomcat configuration seems correct, while the liferay application is not marked as 'distributable'.
       Updated by Massimiliano Assante almost 9 years ago
      Updated by Massimiliano Assante almost 9 years ago
      
    
    - % Done changed from 0 to 30
After performing several tests this morning we haven't found the exact cause yet.
It could be related to the firewall present on both nodes but we have a non-deterministic behaviour:
When the firewall is up, sometimes the communication occurs sometimes it doesn't (mostly don't). When the firewall is down the communication occurred all the time (so far). 
We need more tests to understand where the problem is.
       Updated by Massimiliano Assante almost 9 years ago
      Updated by Massimiliano Assante almost 9 years ago
      
    
    when I disable the firewall on both preprod1 and preprod2 I can see this in the log of preprod2 (not in preprod1 though)
Dec 14, 2016 10:06:05 AM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded
INFO: Replication member added:org.apache.catalina.tribes.membership.MemberImpl[tcp://{146, 48, 122, 240}:4000,{146, 48, 122, 240},4000, alive=57072890, securePort=-1, UDP Port=-1, id={-59 -67 -25 -88 -9 83 72 19 -112 -122 99 -65 -42 49 -46 126 }, payload={}, command={}, domain={}, ]
       Updated by Andrea Dell'Amico almost 9 years ago
      Updated by Andrea Dell'Amico almost 9 years ago
      
    
    Hm. This should the normal behaviour also with the firewall enabled, and I did see the cluster nodes talk each other while testing the other day.
I see two possibilities:
- We continue the testing, gathering the traffic and analyzing it;
- We completely open the IP traffic between the tomcat nodes.
The second is OK with me.
       Updated by Massimiliano Assante almost 9 years ago
      Updated by Massimiliano Assante almost 9 years ago
      
    
    - % Done changed from 30 to 70
Here is the result of my tests this morning (only for liferay cluster ).
The tests consisted in one user on preprod1 asking to register to a VRE, and on the manager user accepting the request on preprod2 for 5 different users. The results have shown that the firewall is somehow interfering sometimes.
With Firewall UP
- 1st attempt worked as expected
- 2nd attempt worked as expected
- 3rd attempt did not work as expected: Join Request not arrived on preprod2, Accepted Request on preprod2 (After clearing liferay cluster cache manually) not reflected in preprod1
- 4th attempt did not work as expected: Join Request arrived on preprod2, Accepted Request on preprod2 not reflected in preprod1
- 5th attempt did not work as expected: Join Request not arrived on preprod2, Accepted Request on preprod2 (After clearing liferay cluster cache manually) not reflected in preprod1
With Firewall DOWN
- 1st attempt worked as expected
- 2nd attempt worked as expected
- 3rd attempt worked as expected
- 4th attempt worked as expected
- 5th attempt worked as expected
So, as long as we don't understand completely what are the ports and protocols used by liferay we should completely open the IP traffic between the tomcat nodes, at least this seems to make work the liferay cluster.
       Updated by Andrea Dell'Amico almost 9 years ago
      Updated by Andrea Dell'Amico almost 9 years ago
      
    
    I just reviewed the firewall rules and they permit all the tcp traffic between the tomcat nodes (it was added because of some portlet I don't remember anymore).
If you don't mind trying another test, we could switch to a full unicast cluster configuration. That will need some reconfiguration each time a new node is added (an entry for each tomcat node is needed on the server.xml of every cluster node).
I can also confirm that if you don't add the <distributable /> tag to the lifereay's main web.xml, the users sessions cannot be migrated.
       Updated by Massimiliano Assante almost 9 years ago
      Updated by Massimiliano Assante almost 9 years ago
      
    
    - Status changed from In Progress to Closed
- % Done changed from 70 to 100
trying to dig more into http session replication I read carefully the post you linked (https://web.liferay.com/community/forums/-/message_boards/message/35359683) the guy there claims that also each single portlet in their web.xml needs to add the < distributable > tag.
Since there's no time to do this now, and to perform more tests on this I'm going to close this ticket considering the liferay cluster (with no firewall) an acceptable solution, although without HA.
I'm going to open a Feature ticket on gCube so that we can investigate and proper enable session replication in 2017.
       Updated by Massimiliano Assante almost 9 years ago
      Updated by Massimiliano Assante almost 9 years ago
      
    
    made some tests in production and everything worked as expected.
       Updated by Andrea Dell'Amico almost 9 years ago
      Updated by Andrea Dell'Amico almost 9 years ago
      
    
    - Status changed from Closed to In Progress
- Assignee changed from Massimiliano Assante to Andrea Dell'Amico
- % Done changed from 100 to 50
The next move will be test the preproduction environment after opening the firewall to all the multicast networks.
If it will still refuse to work correctly I'll need an hour of exclusive usage of the portal to collect the network traffic while some clustered operations are ongoing.
       Updated by Massimiliano Assante almost 9 years ago
      Updated by Massimiliano Assante almost 9 years ago
      
    
    - Related to Task #6232: Enable Infrastructure Gateway cluster communication behind firewall added
       Updated by Massimiliano Assante almost 9 years ago
      Updated by Massimiliano Assante almost 9 years ago
      
    
    - Status changed from In Progress to Closed
ok, but let's close this incident and use this task to track the activity: https://support.d4science.org/issues/6232
 
  
  