Task #12449
closed
Add Nagios checks on couchbase cluster
100%
Description
It seems that the new buckets in couchbase cluster are not monitored by Nagios.
This is my mistake because I never advise you about them.
The old bucket is going to be deleted (see #12446) has the following checks:
- accounting_service OPS
- accounting_service VB total items
- accounting_service disk creates per second
- accounting_service items count
- accounting_service used memory
Related issues
Updated by Luca Frosini almost 7 years ago
- Related to Task #12446: Remove accounting_service bucket added
Updated by Andrea Dell'Amico almost 7 years ago
Can you list the buckets that need monitoring?
Updated by Luca Frosini almost 7 years ago
The buckets to be monitored are:
accounting_storage_status
AccountingManager
JobUsageRecord
ServiceUsageRecord
StorageUsageRecord
Updated by Tommaso Piccioli almost 7 years ago
- Status changed from New to In Progress
- % Done changed from 0 to 80
New nagios check on the selected buckets but we have to customize the parameters with @luca.frosini@isti.cnr.it
Updated by Luca Frosini almost 7 years ago
I read the documentation of the couchbase nagios plugin at:
https://gcube.wiki.gcube-system.org/gcube/Monitoring_a_gCube_infrastructure_With_Nagios#Couchbase_plugin
which is more or less the documentation provided by the plugin.
I really don't know how to tune the metrics. Maybe, we are not so interested in monitoring buckets metrics, instead, we are interested in monitoring the cluster sanity.
Looking the alert received tonight they are just useless and instead they could cause discarding the important ones.
@tommaso.piccioli@isti.cnr.it @andrea.dellamico@isti.cnr.it @pasquale.pagano@isti.cnr.it what do you think?
Updated by Andrea Dell'Amico almost 7 years ago
We do not collect metrics in nagios, and the checks are failing because the service is so slow to answer that the timeout is triggered and that seems independent from the specific check: they all fail.
I've checked the plugin options and there's no way to specifiy a longer timeout without changing the code. It uses python requests(), so it should be easy.
There also are a lot of parameters that we do not use, so I don't know if we are checking the most significant aspects of the cluster.
(I didn't know about the existence of that wiki page, most of the information reported is obsolete, FYI)
Updated by Andrea Dell'Amico almost 7 years ago
- Status changed from In Progress to Feedback
- % Done changed from 80 to 100
I just configured a timeout in the couchbase check code:
r = requests.get(url, auth=(options.username, options.password),timeout=(10,120))
If it works we should create a proper fix and send a pull request to the author
Updated by Andrea Dell'Amico almost 7 years ago
- Status changed from Feedback to Closed
The change worked, it seems. I'm closing the ditcket.