Task #12449
closed
Add Nagios checks on couchbase cluster
Added by Luca Frosini almost 7 years ago.
Updated almost 7 years ago.
Assignee:
_InfraScience Systems Engineer
Infrastructure:
Production
Description
It seems that the new buckets in couchbase cluster are not monitored by Nagios.
This is my mistake because I never advise you about them.
The old bucket is going to be deleted (see #12446) has the following checks:
- accounting_service OPS
- accounting_service VB total items
- accounting_service disk creates per second
- accounting_service items count
- accounting_service used memory
- Related to Task #12446: Remove accounting_service bucket added
Can you list the buckets that need monitoring?
The buckets to be monitored are:
accounting_storage_status
AccountingManager
JobUsageRecord
ServiceUsageRecord
StorageUsageRecord
- Status changed from New to In Progress
- % Done changed from 0 to 80
New nagios check on the selected buckets but we have to customize the parameters with @luca.frosini@isti.cnr.it
I read the documentation of the couchbase nagios plugin at:
https://gcube.wiki.gcube-system.org/gcube/Monitoring_a_gCube_infrastructure_With_Nagios#Couchbase_plugin
which is more or less the documentation provided by the plugin.
I really don't know how to tune the metrics. Maybe, we are not so interested in monitoring buckets metrics, instead, we are interested in monitoring the cluster sanity.
Looking the alert received tonight they are just useless and instead they could cause discarding the important ones.
@tommaso.piccioli@isti.cnr.it @andrea.dellamico@isti.cnr.it @pasquale.pagano@isti.cnr.it what do you think?
We do not collect metrics in nagios, and the checks are failing because the service is so slow to answer that the timeout is triggered and that seems independent from the specific check: they all fail.
I've checked the plugin options and there's no way to specifiy a longer timeout without changing the code. It uses python requests(), so it should be easy.
There also are a lot of parameters that we do not use, so I don't know if we are checking the most significant aspects of the cluster.
(I didn't know about the existence of that wiki page, most of the information reported is obsolete, FYI)
- Status changed from In Progress to Feedback
- % Done changed from 80 to 100
I just configured a timeout in the couchbase check code:
r = requests.get(url, auth=(options.username, options.password),timeout=(10,120))
If it works we should create a proper fix and send a pull request to the author
- Status changed from Feedback to Closed
The change worked, it seems. I'm closing the ditcket.
Also available in: Atom
PDF