Incident #6575
closedInfra Gateway-1 node Garbage Collection (GC) process out of memory
100%
Description
infra-gateway1 went out of memory today around 16 CET, the stack trace is in attachment. At a glance it is unclear what caused the problem, though many exception are reported from couch base.
Alessandro can you check?
Files
Updated by Massimiliano Assante over 8 years ago
from Nagios:
Service: Memory status Host: infra-gateway1.d4science.org Address: infra-gateway1.d4science.org State: WARNING Date/Time: Tue Jan 24 15:28:59 CET 2017 Additional Info: CHECK_MEMORY WARNING - 0G free Notification Type: PROBLEM Service: Haproxy backends Host: infra-lb.d4science.org Address: infra-lb.d4science.org State: CRITICAL Date/Time: Tue Jan 24 16:01:59 CET 2017 Additional Info: Check haproxy CRITICAL - server: infra_backend:infra2 is DOWN 1/3 (check status: layer 7 (HTTP/SMTP) timeout):
Updated by Massimiliano Assante over 8 years ago
- Status changed from New to In Progress
Updated by Alessandro Pieve over 8 years ago
I checked the log file and I found about 20 accounting exceptions.
Considering that this "out of memory" issue do not occur on other nodes with thousands of exceptions, I am quite confident about the fact that the number of exceptions did not cause the issue.
In any case those errors were related to requests made by me to check the arrival of the data from the registry and the collector, while the cluster was also executing aggregation.
In the file I also noticed other errors and problems with disconnections yesterday, perhaps there have been other issues.
However, I can increase the request time out limit of the Couchbase library.
Updated by Pasquale Pagano over 8 years ago
- % Done changed from 0 to 20
this incident can be hardly fixed with the current information. It is was important to look at the logs but if you don't identify a trace to follow, please comment and close this issue.
Updated by Massimiliano Assante over 8 years ago
We have a leak, clearly and we must identify what happened yesterday otherwise the issue will happen again we cannot close the ticket.
Updated by Pasquale Pagano over 8 years ago
Yes, I agree but keeping the ticket open does not solve the issue. The ticket is assigned to Alessandro and I understood from his post that the issue is not related, or very hardly is related, to the accounting. So, what next? Who has to perform additional investigations?
Updated by Massimiliano Assante over 8 years ago
- Assignee changed from Alessandro Pieve to Massimiliano Assante
Updated by Massimiliano Assante over 8 years ago
- Status changed from In Progress to Closed
- % Done changed from 20 to 50
By looking at the logs I can only confirm that between 14.00 and 16.00 CET the only exceptions where due to the accounting, nothing else weird in the logs. Both cluster nodes were restarted last Saturday from me for fixing a problem with email notification, so the leak is indeed occurred between, Sunday and Tuesday PM.
Unfortunately the portal cluster nodes memory occupation was not monitored as they were not on Ganglia (for an error occurred) nor on Munin (they were never added). Tickets have been opened to fix the lacking of monitoring.
Nothing else can be done until the leak appears again.