Task #9551
openSome jobs wrote big chunks of data into the persistence jackrabbit table
10%
Description
From Aug 19th and 21th something (someone) wrote repeatedly into the workspace big chunks of data into the persistence manager data field.
We need to investigate and find the rogue process to avoid a similar event in the future.
Updated by Valentina Marioli over 7 years ago
- Status changed from New to In Progress
Updated by Gianpaolo Coro over 7 years ago
When a computation fails, a gCube Item is created having a stacktrace attached. In the case of BiOnym, this can be long since it contains the logs by F. Fiorellato's (ex-FAO) libraries for taxonomic names matching. If the WS or other system was blocked for some reason, the ~3400 calls per day could have reported a large amount of stacktrace logs. Thus, the large number of logs and occupied space could be an indicator that either the WS system or other satellite services could be not working. For example, if the URI-Resolver was unavailable for some reason, BiOnym would have failed all its computations (because it uses files accessed on the WS through the URI-Resolver).
Some questions need clarification:
1 - If either the WS or satellite services were not working, did we have nagios alerts?
2 - Were there issues on the URI-Resolver in those days?
3 - Are the logs in the BiOnym stacktrace so huge to justify that occupation?
Updated by Andrea Dell'Amico over 7 years ago
I also uploaded a blob of something that seems an execution log, with failures: https://goo.gl/VNmpeJ
Updated by Andrea Dell'Amico over 7 years ago
Gianpaolo Coro wrote:
When a computation fails, a gCube Item is created having a stacktrace attached. In the case of BiOnym, this can be long since it contains the logs by F. Fiorellato's (ex-FAO) libraries for taxonomic names matching. If the WS or other system was blocked for some reason, the ~3400 calls per day could have reported a large amount of stacktrace logs. Thus, the large number of logs and occupied space could be an indicator that either the WS system or other satellite services could be not working. For example, if the URI-Resolver was unavailable for some reason, BiOnym would have failed all its computations (because it uses files accessed on the WS through the URI-Resolver).
Some questions need clarification:
1 - If either the WS or satellite services were not working, did we have nagios alerts?
Yes. The infra portals, workspace, mongo and the URI resolver are monitored. The URI resolver is also under HA since some months.
2 - Were there issues on the URI-Resolver in those days?
No
3 - Are the logs in the BiOnym stacktrace so huge to justify that occupation?
Those blobs do not contain stacktraces only. The second one I posted it's a stack trace, the first one if full of the same strings repeated thousands of times.
Updated by Costantino Perciante over 7 years ago
If we really want to understand how the content is stored in there, then we will need to inspect the jackrabbit code. I think it would be a quite expensive operation.
Andrea, I think we should restore (somewhere else of course) a dump of that database (we can discuss about the dates.. something before 22-23/08 can be useful). This will allow us to compare some rows of the pm_default_bundle table in terms of size and content (and presence as well)
Also, we figured out that postgres (luckily) compresses data, so octet_length when a bytea field is involved just returns the size of the compressed data (the biggest rows needs 23MB in postgres, but after decompression it takes what Andrea says.. i.e. more than 400 MB)
Updated by Andrea Dell'Amico over 7 years ago
@valentina.marioli@isti.cnr.it just asked to restore those backups. You can find them on workspace-repository-prod1.d4science.org under /data
. There's a backup from August 19th and one from Aug 16th.
Updated by Gianpaolo Coro over 7 years ago
One of the big blobs contains a large quantity of objects generated by the old-good Statistical Manager between 2014 and Nov. 2016. These files were written in the ".Application" hidden folder, which is empty for the statistical.manager user. Since this logic was contained in the StatMan service (not in the algorithms execution engine), it is not contained in any way in the DataMiner. Those are old objects that have come from somewhere. Valentina is now checking if they were present also in the Derby DB. Further, there is no StatMan service currently connected to the infrastructure.
Updated by Costantino Perciante over 7 years ago
Gianpaolo Coro wrote:
One of the big blobs contains a large quantity of objects generated by the old-good Statistical Manager between 2014 and Nov. 2016. These files were written in the ".Application" hidden folder, which is empty for the statistical.manager user. Since this logic was contained in the StatMan service (not in the algorithms execution engine), it is not contained in any way in the DataMiner. Those are old objects that have come from somewhere. Valentina is now checking if they were present also in the Derby DB. Further, there is no StatMan service currently connected to the infrastructure.
I guess you are referring to the largest blobs we found. As far as the other one (i.e. the one in which a stacktrace is reported), I think it would be better to manage that situation in another way (e.g., send it via mail/message), instead of adding the result as property. I hope it can be done
Updated by Massimiliano Assante over 7 years ago
- Assignee changed from Valentina Marioli to Costantino Perciante
Updated by Pasquale Pagano over 7 years ago
- Tracker changed from Incident to Task
Updated by Andrea Dell'Amico over 7 years ago
Some days ago, the DB size increased by almost 10GB in 48 hours. It's been stable again since than, but those increments seem unpredictable and they could be dangerous.
Updated by Costantino Perciante about 7 years ago
- Assignee changed from Costantino Perciante to Lucio Lelii