Project

General

Profile

Actions

Incident #6575

closed

Infra Gateway-1 node Garbage Collection (GC) process out of memory

Added by Massimiliano Assante over 8 years ago. Updated over 8 years ago.

Status:
Closed
Priority:
Urgent
Category:
-
Target version:
Start date:
Jan 24, 2017
Due date:
Jan 25, 2017
% Done:

100%

Estimated time:
Infrastructure:
Production

Description

infra-gateway1 went out of memory today around 16 CET, the stack trace is in attachment. At a glance it is unclear what caused the problem, though many exception are reported from couch base.

Alessandro can you check?


Files

catalina.out.zip (861 KB) catalina.out.zip Massimiliano Assante, Jan 24, 2017 06:47 PM
Actions #1

Updated by Massimiliano Assante over 8 years ago

from Nagios:

Service: Memory status
Host: infra-gateway1.d4science.org
Address: infra-gateway1.d4science.org
State: WARNING

Date/Time: Tue Jan 24 15:28:59 CET 2017

Additional Info:

CHECK_MEMORY WARNING - 0G free
Notification Type: PROBLEM

Service: Haproxy backends
Host: infra-lb.d4science.org
Address: infra-lb.d4science.org
State: CRITICAL

Date/Time: Tue Jan 24 16:01:59 CET 2017

Additional Info:

Check haproxy CRITICAL - server: infra_backend:infra2 is DOWN 1/3 (check status: layer 7 (HTTP/SMTP) timeout):
Actions #2

Updated by Massimiliano Assante over 8 years ago

  • Status changed from New to In Progress
Actions #3

Updated by Alessandro Pieve over 8 years ago

I checked the log file and I found about 20 accounting exceptions.
Considering that this "out of memory" issue do not occur on other nodes with thousands of exceptions, I am quite confident about the fact that the number of exceptions did not cause the issue.
In any case those errors were related to requests made by me to check the arrival of the data from the registry and the collector, while the cluster was also executing aggregation.
In the file I also noticed other errors and problems with disconnections yesterday, perhaps there have been other issues.
However, I can increase the request time out limit of the Couchbase library.

Actions #4

Updated by Pasquale Pagano over 8 years ago

  • % Done changed from 0 to 20

this incident can be hardly fixed with the current information. It is was important to look at the logs but if you don't identify a trace to follow, please comment and close this issue.

Actions #5

Updated by Massimiliano Assante over 8 years ago

We have a leak, clearly and we must identify what happened yesterday otherwise the issue will happen again we cannot close the ticket.

Actions #6

Updated by Pasquale Pagano over 8 years ago

Yes, I agree but keeping the ticket open does not solve the issue. The ticket is assigned to Alessandro and I understood from his post that the issue is not related, or very hardly is related, to the accounting. So, what next? Who has to perform additional investigations?

Actions #7

Updated by Massimiliano Assante over 8 years ago

  • Assignee changed from Alessandro Pieve to Massimiliano Assante
Actions #8

Updated by Massimiliano Assante over 8 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 20 to 50

By looking at the logs I can only confirm that between 14.00 and 16.00 CET the only exceptions where due to the accounting, nothing else weird in the logs. Both cluster nodes were restarted last Saturday from me for fixing a problem with email notification, so the leak is indeed occurred between, Sunday and Tuesday PM.

Unfortunately the portal cluster nodes memory occupation was not monitored as they were not on Ganglia (for an error occurred) nor on Munin (they were never added). Tickets have been opened to fix the lacking of monitoring.

Nothing else can be done until the leak appears again.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 8.91 MB)