Project

General

Profile

Actions

Task #12449

closed

Add Nagios checks on couchbase cluster

Added by Luca Frosini almost 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
_InfraScience Systems Engineer
Category:
Application
Target version:
Start date:
Sep 10, 2018
Due date:
% Done:

100%

Estimated time:
Infrastructure:
Production

Description

It seems that the new buckets in couchbase cluster are not monitored by Nagios.

This is my mistake because I never advise you about them.

The old bucket is going to be deleted (see #12446) has the following checks:

  • accounting_service OPS
  • accounting_service VB total items
  • accounting_service disk creates per second
  • accounting_service items count
  • accounting_service used memory

Related issues

Related to D4Science Infrastructure - Task #12446: Remove accounting_service bucketClosedLuca FrosiniSep 10, 2018

Actions
Actions #1

Updated by Luca Frosini almost 7 years ago

  • Related to Task #12446: Remove accounting_service bucket added
Actions #2

Updated by Andrea Dell'Amico almost 7 years ago

Can you list the buckets that need monitoring?

Actions #3

Updated by Luca Frosini almost 7 years ago

The buckets to be monitored are:
accounting_storage_status
AccountingManager
JobUsageRecord
ServiceUsageRecord
StorageUsageRecord

Actions #4

Updated by Tommaso Piccioli almost 7 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 80

New nagios check on the selected buckets but we have to customize the parameters with @luca.frosini@isti.cnr.it

Actions #5

Updated by Luca Frosini almost 7 years ago

I read the documentation of the couchbase nagios plugin at:
https://gcube.wiki.gcube-system.org/gcube/Monitoring_a_gCube_infrastructure_With_Nagios#Couchbase_plugin

which is more or less the documentation provided by the plugin.

I really don't know how to tune the metrics. Maybe, we are not so interested in monitoring buckets metrics, instead, we are interested in monitoring the cluster sanity.

Looking the alert received tonight they are just useless and instead they could cause discarding the important ones.

@tommaso.piccioli@isti.cnr.it @andrea.dellamico@isti.cnr.it @pasquale.pagano@isti.cnr.it what do you think?

Actions #6

Updated by Andrea Dell'Amico almost 7 years ago

We do not collect metrics in nagios, and the checks are failing because the service is so slow to answer that the timeout is triggered and that seems independent from the specific check: they all fail.
I've checked the plugin options and there's no way to specifiy a longer timeout without changing the code. It uses python requests(), so it should be easy.

There also are a lot of parameters that we do not use, so I don't know if we are checking the most significant aspects of the cluster.

(I didn't know about the existence of that wiki page, most of the information reported is obsolete, FYI)

Actions #7

Updated by Andrea Dell'Amico almost 7 years ago

  • Status changed from In Progress to Feedback
  • % Done changed from 80 to 100

I just configured a timeout in the couchbase check code:

r = requests.get(url, auth=(options.username, options.password),timeout=(10,120))

If it works we should create a proper fix and send a pull request to the author

Actions #8

Updated by Andrea Dell'Amico almost 7 years ago

  • Status changed from Feedback to Closed

The change worked, it seems. I'm closing the ditcket.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 8.91 MB)