Incident #6575: Infra Gateway-1 node Garbage Collection (GC) process out of memory - D4Science Infrastructure - D4science

Custom queries

03. VRE Plan
04. All Projects Deliverables
04. All Projects Milestones
06. D4Science Projects Activities
All open Issues
D4Science Infrastructure Upgrade
My InProgress Tasks
My Paused Tasks
My Watched Tasks
Not Closed Assigned To Me
Not Closed Assigned To Me (No Releases, No Project Tasks)
Not Closed Opened by Me
Open Issues
Upgrade Plan
Upgrade Plan Clear

Actions

Copy link

Incident #6575

closed

Infra Gateway-1 node Garbage Collection (GC) process out of memory

Added by Massimiliano Assante about 9 years ago. Updated about 9 years ago.

Status:

Closed

Priority:

Urgent

Assignee:

Massimiliano Assante

Category:

Target version:

UnSprintable

Start date:

Jan 24, 2017

Due date:

Jan 25, 2017

% Done:

100%

Estimated time:

Infrastructure:

Production

Description

infra-gateway1 went out of memory today around 16 CET, the stack trace is in attachment. At a glance it is unclear what caused the problem, though many exception are reported from couch base.

Alessandro can you check?

Files

catalina.out.zip (861 KB) catalina.out.zip

Massimiliano Assante, Jan 24, 2017 06:47 PM

History
Notes
Property changes

Actions

Copy link

Updated by Massimiliano Assante about 9 years ago

from Nagios:

Service: Memory status
Host: infra-gateway1.d4science.org
Address: infra-gateway1.d4science.org
State: WARNING

Date/Time: Tue Jan 24 15:28:59 CET 2017

Additional Info:

CHECK_MEMORY WARNING - 0G free
Notification Type: PROBLEM

Service: Haproxy backends
Host: infra-lb.d4science.org
Address: infra-lb.d4science.org
State: CRITICAL

Date/Time: Tue Jan 24 16:01:59 CET 2017

Additional Info:

Check haproxy CRITICAL - server: infra_backend:infra2 is DOWN 1/3 (check status: layer 7 (HTTP/SMTP) timeout):

Actions

Copy link

Updated by Massimiliano Assante about 9 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Alessandro Pieve about 9 years ago

I checked the log file and I found about 20 accounting exceptions.
Considering that this "out of memory" issue do not occur on other nodes with thousands of exceptions, I am quite confident about the fact that the number of exceptions did not cause the issue.
In any case those errors were related to requests made by me to check the arrival of the data from the registry and the collector, while the cluster was also executing aggregation.
In the file I also noticed other errors and problems with disconnections yesterday, perhaps there have been other issues.
However, I can increase the request time out limit of the Couchbase library.

Actions

Copy link

Updated by Pasquale Pagano about 9 years ago

% Done changed from 0 to 20

this incident can be hardly fixed with the current information. It is was important to look at the logs but if you don't identify a trace to follow, please comment and close this issue.

Actions

Copy link

Updated by Massimiliano Assante about 9 years ago

We have a leak, clearly and we must identify what happened yesterday otherwise the issue will happen again we cannot close the ticket.

Actions

Copy link

Updated by Pasquale Pagano about 9 years ago

Yes, I agree but keeping the ticket open does not solve the issue. The ticket is assigned to Alessandro and I understood from his post that the issue is not related, or very hardly is related, to the accounting. So, what next? Who has to perform additional investigations?

Actions

Copy link

Updated by Massimiliano Assante about 9 years ago

Assignee changed from Alessandro Pieve to Massimiliano Assante

Actions

Copy link

Updated by Massimiliano Assante about 9 years ago

Status changed from In Progress to Closed
% Done changed from 20 to 50

By looking at the logs I can only confirm that between 14.00 and 16.00 CET the only exceptions where due to the accounting, nothing else weird in the logs. Both cluster nodes were restarted last Saturday from me for fixing a problem with email notification, so the leak is indeed occurred between, Sunday and Tuesday PM.

Unfortunately the portal cluster nodes memory occupation was not monitored as they were not on Ganglia (for an error occurred) nor on Munin (they were never added). Tickets have been opened to fix the lacking of monitoring.

Nothing else can be done until the leak appears again.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

D4Science Infrastructure

Custom queries

Incident #6575

Infra Gateway-1 node Garbage Collection (GC) process out of memory

Updated by Massimiliano Assante about 9 years ago

Updated by Massimiliano Assante about 9 years ago

Updated by Alessandro Pieve about 9 years ago

Updated by Pasquale Pagano about 9 years ago

Updated by Massimiliano Assante about 9 years ago

Updated by Pasquale Pagano about 9 years ago

Updated by Massimiliano Assante about 9 years ago

Updated by Massimiliano Assante about 9 years ago