Incident #10910: The accounting fallback logs are killing a lot of services - D4Science Infrastructure - D4science

Actions

Copy link

Incident #10910

closed

The accounting fallback logs are killing a lot of services

Added by Andrea Dell'Amico almost 8 years ago. Updated almost 8 years ago.

Status:

Closed

Priority:

Immediate

Assignee:

Andrea Dell'Amico

Category:

System Application

Target version:

No Sprint

Start date:

Jan 13, 2018

Due date:

% Done:

100%

Estimated time:

Infrastructure:

Production

Description

Here is a probably incomplete list (not all servers are monitored by nagios):

workspace-repository-prod1.d4science.org
dataminer0-proto.d4science.org
dataminer1-proto.d4science.org
dataminer2-proto.d4science.org
dataminer3-proto.d4science.org
dataminer4-proto.d4science.org
dataminer5-proto.d4science.org
dataminer1-p-d4s.d4science.org
dataminer2-p-d4s.d4science.org
dataminer3-p-d4s.d4science.org
geoserver-protectedareaimpactmaps.d4science.org
geoserver1-protectedareaimpactmaps.d4science.org
geoserver2-protectedareaimpactmaps.d4science.org
thredds.d4science.org

The not working workspace is causing the failure of a lot of dataminer jobs.

Files

fix-accounting-crap.yml (1018 Bytes) fix-accounting-crap.yml

Andrea Dell'Amico, Jan 14, 2018 02:19 PM

Related issues

Actions

Copy link

Updated by Andrea Dell'Amico almost 8 years ago

Related to Incident #10895: accounting bloat on workspace-repository-prod1.d4science.org - AGAIN added

Actions

Copy link

Updated by Andrea Dell'Amico almost 8 years ago

Related to Incident #10701: accounting bloat on workspace-repository-prod1.d4science.org added

Actions

Copy link

Updated by Andrea Dell'Amico almost 8 years ago

Related to Incident #10651: dataminer3-p-d4s.d4science.org filled the disk added

Actions

Copy link

Updated by Andrea Dell'Amico almost 8 years ago

Blocks Incident #10909: Regular Failure of Dataminer "Garr" (2 out of 3 execution attempts) - Internal Server Error added

Actions

Copy link

Updated by Andrea Dell'Amico almost 8 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Andrea Dell'Amico almost 8 years ago

Description updated (diff)

Actions

Copy link

Updated by Andrea Dell'Amico almost 8 years ago

Status changed from In Progress to Closed
% Done changed from 0 to 100

I stopped/cleaned up/restarted the services on all the above hosts. Things seems back to normal, a fallback accounting file is appeared on the workspace but that one seems under control for now:

$ ls -l SmartGears/state/
total 72
-rw-r--r-- 1 gcube gcube 9375 Jan 13 15:24 _d4science.research-infrastructures.eu_SmartArea_SmartCamera.fallback.log

Actions

Copy link

Updated by Andrea Dell'Amico almost 8 years ago

Status changed from Closed to In Progress
% Done changed from 100 to 90

The disk on workspace-repository-prod1.d4science.org was full again. cleaned and restarted

Actions

Copy link

Updated by Andrea Dell'Amico almost 8 years ago

File fix-accounting-crap.yml fix-accounting-crap.yml added

It happened again and I guess it will happen again in the next hours. I'm attaching a playbook that clean up the involved hosts. It have to be run from inside d4science-ghn-cluster this way:

./run.sh fix-accounting-crap.yml -i inventory/hosts.production

Who does only have access as gcube user can change the remote_user directive into remote_user: gcube and comment out the become and become_user occurrences.

Actions

Copy link

#10

Updated by Roberto Cirillo almost 8 years ago

Status changed from In Progress to Closed
% Done changed from 90 to 100

I've upgraded the accounting libraries on "workspace-repository-prod1" as suggested by @luca.frosini@isti.cnr.it :

accounting-lib-3.2.0-4.10.0-162088.jar
document-store-lib-2.2.0-4.10.0-162084.jar

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

D4Science Infrastructure

Custom queries

Incident #10910

The accounting fallback logs are killing a lot of services

Updated by Andrea Dell'Amico almost 8 years ago

Updated by Andrea Dell'Amico almost 8 years ago

Updated by Andrea Dell'Amico almost 8 years ago

Updated by Andrea Dell'Amico almost 8 years ago

Updated by Andrea Dell'Amico almost 8 years ago

Updated by Andrea Dell'Amico almost 8 years ago

Updated by Andrea Dell'Amico almost 8 years ago

Updated by Andrea Dell'Amico almost 8 years ago

Updated by Andrea Dell'Amico almost 8 years ago

Updated by Roberto Cirillo almost 8 years ago