Project

General

Profile

Actions

Incident #10306

closed

mongo3-p-d4s.d4science.org went out of memory more than once

Added by Andrea Dell'Amico almost 8 years ago. Updated almost 8 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
_InfraScience Systems Engineer
Category:
System Application
Target version:
Start date:
Nov 10, 2017
Due date:
% Done:

100%

Estimated time:
Infrastructure:
Production

Description

I see that it was OOMed, and restarted after that. There are nagios alerts about the sync lag, that follow the restarts.


Related issues

Related to D4Science Infrastructure - Task #10279: Large file upload limit on the workspaceClosedCostantino PercianteNov 09, 2017

Actions
Actions #1

Updated by Andrea Dell'Amico almost 8 years ago

  • Related to Task #10279: Large file upload limit on the workspace added
Actions #2

Updated by Andrea Dell'Amico almost 8 years ago

Could it be related to #10279?

Actions #3

Updated by Andrea Dell'Amico almost 8 years ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by Roberto Cirillo almost 8 years ago

  • Status changed from New to In Progress
Actions #5

Updated by Roberto Cirillo almost 8 years ago

  • Status changed from In Progress to Feedback
  • Assignee changed from Roberto Cirillo to _InfraScience Systems Engineer

It cannot be the cause of the issue #10279 since it is a Secondary member and the writing are done on the primary member. But, of course, the high workload in these days could be the cause of the high lag on that node. The high lag was happened only on this node and not on mongo4 and mongo5 (both secondaries nodes) as I would have expected. This maybe is due to the fact that this node has a disk storage more slow compared to the other two secondaries nodes. Is it could be possible?
Anyway for preventing this kind of problem we should increase the HW resource of cluster nodes. Now we have 3 GB RAM and 4 CPU. I think we should increase the RAM to 4 GB on every node.

Actions #6

Updated by Andrea Dell'Amico almost 8 years ago

Roberto Cirillo wrote:

It cannot be the cause of the issue #10279 since it is a Secondary member and the writing are done on the primary member. But, of course, the high workload in these days could be the cause of the high lag on that node. The high lag was happened only on this node and not on mongo4 and mongo5 (both secondaries nodes) as I would have expected. This maybe is due to the fact that this node has a disk storage more slow compared to the other two secondaries nodes. Is it could be possible?

The lags followed the crash and the restart, so I linked them to the fact that it restarted. I did not check if the mongo server also crashed on mongo4 and mongo5.

Anyway for preventing this kind of problem we should increase the HW resource of cluster nodes. Now we have 3 GB RAM and 4 CPU. I think we should increase the RAM to 4 GB on every node.

It's doable. Both Tommaso and me will be back in office on Thursday, but we can manage to do it anyway. Not today, but tomorrow morning maybe?

Actions #7

Updated by Roberto Cirillo almost 8 years ago

Andrea Dell'Amico wrote:

Roberto Cirillo wrote:

It cannot be the cause of the issue #10279 since it is a Secondary member and the writing are done on the primary member. But, of course, the high workload in these days could be the cause of the high lag on that node. The high lag was happened only on this node and not on mongo4 and mongo5 (both secondaries nodes) as I would have expected. This maybe is due to the fact that this node has a disk storage more slow compared to the other two secondaries nodes. Is it could be possible?

The lags followed the crash and the restart, so I linked them to the fact that it restarted. I did not check if the mongo server also crashed on mongo4 and mongo5.

Anyway for preventing this kind of problem we should increase the HW resource of cluster nodes. Now we have 3 GB RAM and 4 CPU. I think we should increase the RAM to 4 GB on every node.

It's doable. Both Tommaso and me will be back in office on Thursday, but we can manage to do it anyway. Not today, but tomorrow morning maybe?

it's ok for me. If it's needed only a VM restart you can do it at any time, if it's needed more time, it's better to do this operation after 6:00 pm when the worload decreases.

Actions #8

Updated by Andrea Dell'Amico almost 8 years ago

Roberto Cirillo wrote:

it's ok for me. If it's needed only a VM restart you can do it at any time, if it's needed more time, it's better to do this operation after 6:00 pm when the worload decreases.

Only a restart is needed.

Actions #9

Updated by Tommaso Piccioli almost 8 years ago

I'm going to restart mongo3 with 4GB of ram just now

Actions #10

Updated by Tommaso Piccioli almost 8 years ago

  • Status changed from Feedback to In Progress
  • % Done changed from 0 to 20

Done for mongo3-p-d4s

Actions #11

Updated by Roberto Cirillo almost 8 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 20 to 100

Thank you @tommaso.piccioli@isti.cnr.it I'm going to close this ticket.

Actions #12

Updated by Andrea Dell'Amico almost 8 years ago

  • Status changed from Closed to In Progress
  • % Done changed from 100 to 20

I'm reopening it. Let's increase the RAM amount on all the other nodes so that they are identical.

Actions #13

Updated by Roberto Cirillo almost 8 years ago

I think the problem is on mongo3 node. This node has the same configuration and the same workload of mongo2 but the load on mongo3 is very high while the load on mongo2 is very low.
I think this depends on the physical machine when mongo3 is running. If this is the case, I think we should transfer mongo3 on another machine.

Actions #14

Updated by Tommaso Piccioli almost 8 years ago

last hour news: there are connections to mongod on mongo3-p-d4s.d4science.org from the social-isti portal.

portal-si.isti.cnr.it shows established connection on port 27017 to all social-isti mongod nodes but mongoR4-si.isti.cnr.it and there are instead connections to mongo3-p-d4s

Actions #15

Updated by Tommaso Piccioli almost 8 years ago

Tommaso Piccioli wrote:

portal-si.isti.cnr.it shows established connection on port 27017 to all social-isti mongod nodes but mongoR4-si.isti.cnr.it and there are instead connections to mongo3-p-d4s

Could someone fix this and restart the service to stop these connections?
It is the jackrabbit tomcat app on portal-si.isti.cnr.it

Actions #17

Updated by Roberto Cirillo almost 8 years ago

Tommaso Piccioli wrote:

Tommaso Piccioli wrote:

portal-si.isti.cnr.it shows established connection on port 27017 to all social-isti mongod nodes but mongoR4-si.isti.cnr.it and there are instead connections to mongo3-p-d4s

Could someone fix this and restart the service to stop these connections?
It is the jackrabbit tomcat app on portal-si.isti.cnr.it

I've restarted the service. @tommaso.piccioli@isti.cnr.it, please, could you check if there are other connections from portal-si to mongo3?

Actions #18

Updated by Tommaso Piccioli almost 8 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 20 to 100

no more connections from portal-si to mongo3.

mongo2-p-d4s and mongo4-p-d4s restarted with 4 GB of ram.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 8.91 MB)