Incident #11147
closedThe proto dataminers are running out of space
100%
Description
There's data under /home/gcube/tomcat/webapps/wps/ecocfg
that seems left by past computations, from February 6th and 7th while the most recent process I can see started Feb 8th.
Updated by Andrea Dell'Amico over 7 years ago
- Project changed from 8 to D4Science Infrastructure
- Category changed from Default to Application
- Priority changed from Normal to High
- Target version changed from Unsprintable to UnSprintable
Updated by Gianpaolo Coro over 7 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
The problem was due to long running and memory demanding processes run by @scarponi@isti.cnr.it . It seems that many of them were oom-killed and thus the DataMiner did not delete the input files, which were huge. @roberto.cirillo@isti.cnr.it has now deleted these files to stem the problem, but we should understand how to tackle it and if this is possible from the service. I will open a separate ticket for this. Overall, the current scenario is due to the fact that Paolo's processes are at the boundaries of our resources and should be re-designed perhaps.
For sake of completeness, I report the OOM-Kill messages we found in the system logs:
Feb 6 09:17:11 dataminer0-proto logger: algorithms-updater: the add command string is ./addAlgorithm.sh ARCHNER_IT NER_ALGORITHMS org.gcube.dataanalysis.executor.rscripts.ArchnerIt /d4science.research-infrastructures.eu/gCubeApps/ RPrototypingLab transducerers Y http://data.d4science.org/c1ZaT0ZOSi8wQlI4bDRQdDFHSmdFZCtMb3YvcDA5STdHbWJQNStIS0N6Yz0 "Archaeological NER web-service for Italian text. This named entity recognition web-service will find Artefact, C olour, Material, Period, Place, Person, Site, Technique and Timespan entities" proto/software Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258536] R invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258540] R cpuset=/ mems_allowed=0 Feb 6 09:17:11 dataminer0-proto logger: algorithms-updater: the add command string is ./addAlgorithm.sh ARCHNER_IT NER_ALGORITHMS org.gcube.dataanalysis.executor.rscripts.ArchnerIt /d4science.research-infrastructures.eu/gCubeApps/ RPrototypingLab transducerers Y http://data.d4science.org/c1ZaT0ZOSi8wQlI4bDRQdDFHSmdFZCtMb3YvcDA5STdHbWJQNStIS0N6Yz0 "Archaeological NER web-service for Italian text. This named entity recognition web-service will find Artefact, C olour, Material, Period, Place, Person, Site, Technique and Timespan entities" proto/software Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258536] R invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258540] R cpuset=/ mems_allowed=0 Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258543] CPU: 0 PID: 23779 Comm: R Tainted: GF 3.13.0-32-generic #57-Ubuntu Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258545] 0000000000000000 ffff8803aebdfa80 ffffffff8171bcb4 ffff880004a32fe0 Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258548] ffff8803aebdfb08 ffffffff817165ef ffffffff81067886 ffff8803aebdfae0 Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258551] ffffffff810c75dc 0000000000000001 ffff8803e60ebe38 0000000000000000 Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258553] Call Trace: Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258561] [<ffffffff8171bcb4>] dump_stack+0x45/0x56 Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258564] [<ffffffff817165ef>] dump_header+0x7f/0x1f1 Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258568] [<ffffffff81067886>] ? put_online_cpus+0x56/0x80 Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258570] [<ffffffff810c75dc>] ? rcu_oom_notify+0xcc/0xf0 Feb 6 09:17:46 dataminer0-proto kernel: [32448367.258574] [<ffffffff81151bfe>] oom_kill_process+0x1ce/0x330 Feb 7 07:58:18 dataminer0-proto logger: algorithms-updater: the add command string is ./addAlgorithm.sh ARCHNER_IT NER_ALGORITHMS org.gcube.dataanalysis.executor.rscripts.ArchnerIt /d4science.research-infrastructures.eu/gCubeApps/ RPrototypingLab transducerers Y http://data.d4science.org/c1ZaT0ZOSi8wQlI4bDRQdDFHSmdFZCtMb3YvcDA5STdHbWJQNStIS0N6Yz0 "Archaeological NER web-service for Italian text. This named entity recognition web-service will find Artefact, C olour, Material, Period, Place, Person, Site, Technique and Timespan entities" proto/software Feb 7 07:59:02 dataminer0-proto CRON[17445]: (gcube) CMD (/usr/local/bin/algorithms-updater > /home/gcube/wps_algorithms_install_log/algorithms_updater_cron.log 2>&1) Feb 6 00:14:11 dataminer0-proto logger: algorithms-updater: algorithm ARCHNER_IT is already present but a newer version exists Feb 7 07:59:03 dataminer0-proto logger: algorithms-updater: another job still running, exiting. Feb 7 07:59:06 dataminer0-proto logger: algorithms-updater: the adding of algorithm ARCHNER_IT succeeded Feb 7 07:59:06 dataminer0-proto logger: algorithms-updater: Exiting Feb 7 07:59:14 dataminer0-proto kernel: [32530054.819562] R invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 Feb 7 07:59:14 dataminer0-proto kernel: [32530054.819566] R cpuset=/ mems_allowed=0
Updated by Andrea Dell'Amico over 7 years ago
Keep also in mind that the proto dataminers have half the RAM of the other production ones.