Project

General

Profile

Actions

Task #12861

closed

Task #12858: Purge old records from public GRSF VRE

Purge old records of the public GRSF Catalogue

Added by Francesco Mangiacrapa over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Urgent
Target version:
Start date:
Nov 12, 2018
Due date:
% Done:

100%

Estimated time:

Description

We need to remove all records of the public GRSF Catalogue (it's https://ckan-grsf.d4science.org/)


Files


Related issues

Related to D4Science Infrastructure - Incident #12944: Error Connecting on rstudio2.d4science.orgClosedRoberto CirilloNov 27, 2018Nov 27, 2018

Actions
Related to StocksAndFisheriesKB - Bug #12994: Group are not created in GRSF VRERejectedLuca FrosiniDec 06, 2018

Actions
Related to D4Science Infrastructure - Task #13087: Please upgrade grsf-publisher-ws to latest versionClosedAndrea Dell'AmicoDec 27, 2018

Actions
Actions #1

Updated by Francesco Mangiacrapa over 6 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Francesco Mangiacrapa over 6 years ago

  • Status changed from In Progress to Feedback
  • % Done changed from 0 to 100

All records have been removed from https://ckan-grsf.d4science.org/
@marketak@ics.forth.gr and @minadakn@ics.forth.gr let me know if I have to remove the groups too (you can see them at: https://ckan-grsf.d4science.org/group)

Actions #3

Updated by Yannis Marketakis over 6 years ago

If we remove them they are will be re-created when we publish data. Is that correct?

Actions #4

Updated by Francesco Mangiacrapa over 6 years ago

Yannis Marketakis wrote:

If we remove them they are will be re-created when we publish data. Is that correct?

Yes, It is correct. Except for bug, when records will be re-published the groups will be created on-the-fly (and the records added to them) for the fields that have the property isGroup = Yes (see at https://wiki.gcube-system.org/gcube/GCube_Data_Catalogue_for_GRSF), so may I proceed with removing the groups?

Actions #5

Updated by Yannis Marketakis over 6 years ago

Yes please do.
Thanks

Actions #6

Updated by Francesco Mangiacrapa over 6 years ago

  • Status changed from Feedback to Closed
Actions #7

Updated by Aureliano Gentile over 6 years ago

  • Status changed from Closed to In Progress

New records have been published in the GRSF VRE, the entry page is not showing "Browse by Organisations" and "Browse by Groups". Any reasons why?

Actions #8

Updated by Aureliano Gentile over 6 years ago

As per interaction with @francesco.mangiacrapa@isti.cnr.it I would need assistance from:

@marketak@ics.forth.gr for confirming that such approved GRSF records are associated to groups
@luca.frosini@isti.cnr.it to check why groups are missing.

Actions #9

Updated by Yannis Marketakis over 6 years ago

@aureliano.gentile@fao.org grouping is a facility offered by the catalogue. Such information is not stored in the GRSF KB. The published records are added in groups during publishing.

Actions #10

Updated by Aureliano Gentile over 6 years ago

thanks, @francesco.mangiacrapa@isti.cnr.it you have your answer from FORTH.

Actions #11

Updated by Francesco Mangiacrapa over 6 years ago

Yannis Marketakis wrote:

@aureliano.gentile@fao.org grouping is a facility offered by the catalogue. Such information is not stored in the GRSF KB. The published records are added in groups during publishing.

Sure and thanks @marketak@ics.forth.gr
The question is: have you already published some records that had to be added to groups during the publishing? If yes, could you attach (to this ticket) a GRSF record (as JSON source of input) already published? It will be used by @luca.frosini@isti.cnr.it to check why the groups were not created...

Actions #12

Updated by Yannis Marketakis over 6 years ago

@francesco.mangiacrapa@isti.cnr.it the records published in GRSF VRE are replicas of GRSF records found in GRSF Admin VRE. As such all of them fall under certain groups.

You will find the JSON contents of all these GRSF records (597 in total) at https://goo.gl/wQDAV2

Actions #13

Updated by Francesco Mangiacrapa over 6 years ago

Yannis Marketakis wrote:

@francesco.mangiacrapa@isti.cnr.it the records published in GRSF VRE are replicas of GRSF records found in GRSF Admin VRE. As such all of them fall under certain groups.

You will find the JSON contents of all these GRSF records (597 in total) at https://goo.gl/wQDAV2

Thanks a lot, @marketak@ics.forth.gr.
@luca.frosini@isti.cnr.it could you check this issue (why the groups were not created during the publishing) asap?

Actions #14

Updated by Luca Frosini over 6 years ago

  • Related to Incident #12944: Error Connecting on rstudio2.d4science.org added
Actions #15

Updated by Luca Frosini over 6 years ago

  • Status changed from In Progress to Closed

I created the ticket #12994 for the Group issue

Actions #16

Updated by Luca Frosini over 6 years ago

  • Related to Bug #12994: Group are not created in GRSF VRE added
Actions #17

Updated by Francesco Mangiacrapa over 6 years ago

  • Status changed from Closed to In Progress
  • % Done changed from 100 to 90

@marketak@ics.forth.gr, @aureliano.gentile@fao.org

I just copied via script to the GRSF Catalogue (see at https://ckan-grsf.d4science.org/group) the GRSF groups already existing on GRSF-ADMIN Catalogue (https://ckan-grsf-admin2.d4science.org/group).

Now, we need to:

  1. purge all GRSF records published on https://ckan-grsf.d4science.org/dataset;
  2. republish approved GRSF records again to the GRSF VRE. With republishing the groups association should work fine.

Let me know if 2. is feasible, so I'll go with 1.

Actions #18

Updated by Yannis Marketakis over 6 years ago

As soon as there's no other way we will republish them

Actions #19

Updated by Pasquale Pagano over 6 years ago

  • Priority changed from High to Urgent

This task is critical since Aureliano already sent the invitation to the GRSF colleagues for reviewing the content of GRSF.

Please take any reasonable action to fix this issue today if possible.

Thanks

Actions #20

Updated by Luca Frosini over 6 years ago

We cleaned the catalogue.
@marketak@ics.forth.gr can you republish the records? Thanks a lot

Actions #21

Updated by Yannis Marketakis over 6 years ago

Thanks Luca.
I am re-publishing right away

Actions #22

Updated by Aureliano Gentile over 6 years ago

thanks, no one was imaging the need to purge the grsf vre again, or at least nobody warned me. Anyway thanks a lot and sorry for that.
At the moment the group of source records are empty (i.e. Groups Stock - FIRMS or Groups Stock - FIRMS RAM, etc.). But maybe it is due to the publishing still in progress?

Actions #23

Updated by Yannis Marketakis over 6 years ago

The publishing of the records was completed a couple of hours after updating the ticket.
However, I still see that some groups are empty. @francesco.mangiacrapa@isti.cnr.it and @luca.frosini@isti.cnr.it can you please check?

Actions #24

Updated by Francesco Mangiacrapa over 6 years ago

Yannis Marketakis wrote:

The publishing of the records was completed a couple of hours after updating the ticket.
However, I still see that some groups are empty.

Hi @marketak@ics.forth.gr, @luca.frosini@isti.cnr.it,

checking the published records (at https://ckan-grsf.d4science.org/dataset) that were added to groups (at https://ckan-grsf.d4science.org/group), the situation seems to me the following:

  1. the groups for legacy records ('Stock - FIRMS', 'Stock FIRMS FishSource' and so on) are empty because no legacy records have been published... and they will not be published. Is it right? @aureliano.gentile@fao.org, do we want to remove such groups from GRSF Catalogue?

  2. no fishery records have been published, then the "GRSF Fishery" group is empty (see at https://ckan-grsf.d4science.org/group/grsf-fishery)

  3. We should investigate on it (see attached screenshot)... 595 published records are "Assessment Unit" and they all had to be added to:
    ** "GRSF Assessment Unit", they are 593 - 2 records are missing;
    ** GRSF Stock", they are 592 - 3 records are missing;

If this analysis sounds for you, I can go immediately with Luca to find missing published records in those groups...

Actions #26

Updated by Aureliano Gentile over 6 years ago

1: indeed if no legacy records are published it make sense get rid of that group (but to be kept in the GRSF Admin VRE)

2: this is correct, since so far we approved only stocks records. I guess no action is needed for the time being

3: there are not consistent figures, if by types we have 597 records, by groups we should have 597 GRSF Stock and 595 GRSF assessment units and 2 GRSF marine resource, Please see also attached screenshot. (Vice-versa, if groups are correct, then types are wrong).
This is an example of what I mean when I am saying the application is not so much stable, reliable and in any release we need manual checks/fixes, and to understand the underlying reasons...

Actions #27

Updated by Francesco Mangiacrapa over 6 years ago

@aureliano.gentile@fao.org about the point 1.
I'm going to remove the legacy groups from GRSF Catalogue (they will be kept in the GRSF Admin VRE).
The list of legacy groups is reported (in red) in the attached image. Could you confirm the list?

Actions #28

Updated by Aureliano Gentile over 6 years ago

Thanks, sorry but I think there is a misunderstanding, that list is for GRSF records and those groups should be there and populated with the current numbers of approved records. None of those items should be removed. if you want tomorrow we can have a brief call on that.

Actions #29

Updated by Luca Frosini over 6 years ago

Hi all,

The field refers_to is used to automatically create the field Database Source and to add the GRSF record to some groups.

As the wiki report (see https://wiki.gcube-system.org/gcube/GCube_Data_Catalogue_for_GRSF#Common_Metadata), the field refers_to contains:

"A list of objects of the format {"url": "http://", "id": "..."} that allows the aggregated GRSF records to point to their source records **already published in the catalogue. The url and the id are both mandatory and are the ones returned by the services when a source record is published."

The code retrieves the referred records and uses their information to create the field Database Source and to add the record to the appropriates groups.
Unfortunately, the referred records (which are legacy records) are not present in GRSF hence the Database Source is not present and the groups are not added.

Actions #30

Updated by Pasquale Pagano over 6 years ago

  • Assignee changed from Francesco Mangiacrapa to Yannis Marketakis

This issue has to be analyzed by FORTH. As reported, GRSF misses an information that is key for its users. This information is present in GRSF Admin but not reported in GRSF. We need shortly to find a solution to solve this issue:

  • an additional field extracted by the knowledge base and specified at submission time may do the job;
  • a link to the GRSF Admin record could also work but in this case, the user will find a link from GRSF to GRSF Admin and s/he will not have the rights to access it.

Please let us know.

Actions #31

Updated by Yannis Marketakis over 6 years ago

  • Assignee changed from Yannis Marketakis to Aureliano Gentile

I do not see any technical difficulties here. It is clearly a matter of decision.

I would expect that colleagues from FAO come up with a decision about this and we (the technical team) proceed with this.

The alternatives I see are:

Personally speaking, I think that options 3 and 4 are not so elegant. I would like to mention again that there are no technical issues in implementing any of the above. Its clearly a decision to be made.

Actions #32

Updated by Pasquale Pagano over 6 years ago

Waiting for @aureliano.gentile@fao.org, please see below my personal opinion.

The alternatives I see are:

  • Add only the source of the legacy record that contributed for this GRSF record (i.e. FIRMS, RAM, FishSource)

I think this may confuse the user accessing GRSF.

Not so elegant as well since for RAM we have not a persistent URL (BB domain is related to the project and it will not be maintained for so many additional years)

  • Add the catalogue URL of the legacy records from the GRSF_Admin VRE

I think this is safe since we maintain both GRSF Admin and GRSF. Those URLs are persistent and we can properly advise the user that their access require specific privileges.

  • Publish the legacy records in GRSF VRE as well

This is clearly fine if FAO decides to select this option.

Personally speaking, I think that options 3 and 4 are not so elegant.

As you can see above, I think that only 3 and 4 are viable.

Actions #33

Updated by Yannis Marketakis over 6 years ago

Thanks for your answers Lino. See some comments below:

Pasquale Pagano wrote:

Waiting for @aureliano.gentile@fao.org, please see below my personal opinion.

The alternatives I see are:

  • Add only the source of the legacy record that contributed for this GRSF record (i.e. FIRMS, RAM, FishSource)

I think this may confuse the user accessing GRSF.

I do not see why they will be confused.

Not so elegant as well since for RAM we have not a persistent URL (BB domain is related to the project and it will not be maintained for so many additional years)

I agree with this.

  • Add the catalogue URL of the legacy records from the GRSF_Admin VRE

I think this is safe since we maintain both GRSF Admin and GRSF. Those URLs are persistent and we can properly advise the user that their access require specific privileges.

I agree it is safe however the problem here is that users registered in GRSF VRE should also register in GRSF-ADMIN VRE to check them out.

  • Publish the legacy records in GRSF VRE as well

This is clearly fine if FAO decides to select this option.

Personally speaking, I think that options 3 and 4 are not so elegant.

As you can see above, I think that only 3 and 4 are viable.

Actions #34

Updated by Aureliano Gentile over 6 years ago

Thanks to all, appreciated. I discussed the matter also with Anton and I showed him also the citation aspect. We think that it would be enough to have under the box "Data and Resources" simply the list of the data source(s) as appropriate. In the public VRE the link to the legacy record was envisaged as confusing and indeed it was asked to be omitted. if you consider the citation https://support.d4science.org/issues/12278 , following the GRSF record citation we are envisaging something like this "Database sources: [FIRMS]" which then is followed by the original source record citation.

In conclusion, if feasible, at this stage it would be enough to have listed the sources and the "Groups" populated with that information. Admin users have the opportunity to browse legacy records and make all the checks while for the public could be enough like that. The citation, when completed, will give access to the source URL in the data owner websites.

Opening this pilot release to other users and the discussion at FSC11 will give further directions, if needed.

Does it make sense for you? Thanks.

Actions #35

Updated by Yannis Marketakis over 6 years ago

I am OK with that.

Technically speaking this means that we should include the database sources in the JSON serialization (as we do with legacy records). For example:

"database_sources" : [ {
    "name" : "FIRMS",
    "description" : "Fisheries and Resources Monitoring System aims to ...",
    "url" : "http://firms.fao.org/firms/en"
  }, 
  {
    "name" : "FishSource",
    "description" : "FishSource is an online information resource about ...",
    "url" : "http://www.fishsource.com"
  },
  {
    "name" : "RAM",
    "description" : "RAM Legacy Stock Assessment Database is ...",
    "url" : "http://ramlegacy.org"
  }
],

CNR colleagues is this OK with you?

Actions #36

Updated by Luca Frosini over 6 years ago

The solution is ok for me.
Please take into account that this behaviour will occur also on GRSF_Admin VREfor any records containing the field database_sources.

If this is ok for everyone, I'll modify the code.

Actions #37

Updated by Aureliano Gentile over 6 years ago

I Understand this is an additional information added in the json of the grsf record, so at worst it won't be used in specific contexts. So it is fine with me. Fyi, this afternoon we'll have the first call with RAM colleagues to start validating GRSF VRE and approving new records in GRSF VRE Admin. let em know if these modifications implies erase/republish or other drastic actions in the GRSF KB.

Actions #38

Updated by Yannis Marketakis over 6 years ago

@@luca.frosini@isti.cnr.it we already have this feature when publishing records in the GRSF_Admin (in particular when publishing legacy records)

@aureliano.gentile@fao.org
I think we can simply update the existing records in GRSF VRE, so nothing will be removed.

Actions #39

Updated by Luca Frosini over 6 years ago

Yannis Marketakis wrote:

@@luca.frosini@isti.cnr.it we already have this feature when publishing records in the GRSF_Admin (in particular when publishing legacy records)

Hi @marketak@ics.forth.gr,

sorry I lost your comment.

Looking the code seems that database_sources** filed is only used to create additional resources.

Legacy records are added to organizations not to group.

Actions #40

Updated by Yannis Marketakis over 6 years ago

So in that case, the field name should be different. Right?

Actions #41

Updated by Luca Frosini over 6 years ago

Yannis Marketakis wrote:

So in that case, the field name should be different. Right?

It could. But if it is easier for you use the database_sources field I can use it.
In GRSF_admin should not cause any issues because the record is just added twice to the same group.

Just let me know what do you prefer.

Actions #42

Updated by Yannis Marketakis over 6 years ago

I thought it would create issues if the field name is the same.
Since it does not, then use database_sources. It is fine.

Actions #43

Updated by Luca Frosini over 6 years ago

  • Related to Task #13087: Please upgrade grsf-publisher-ws to latest version added
Actions #44

Updated by Luca Frosini over 6 years ago

The new feature is available in the production instance.
@marketak@ics.forth.gr you can update the records when you want.
Please be sure that the update rate is limited as was agreed with Costantino.

Actions #45

Updated by Yannis Marketakis over 6 years ago

Hi Luca. Thanks a lot.
What exactly do you mean with the following?
Luca Frosini wrote:

Please be sure that the update rate is limited as was agreed with Costantino.

Actions #46

Updated by Luca Frosini over 6 years ago

Hi Yannis,

the publishing rate should be limited to avoid failures on async operations caused by workloads.
Costantino told me you agreed on a delay between invocations.
If you are in trouble, giving that they are few records you could try to use 60 seconds.

Is that feasible?

Actions #47

Updated by Yannis Marketakis over 6 years ago

As far as I remember we did not have any issues with updates.
However, it is fine by me to add an idle period between updates.

Thanks

Actions #48

Updated by Yannis Marketakis over 6 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 90 to 100

All the records (597 in number) in GRSF have been updated.

Actions #49

Updated by Aureliano Gentile over 6 years ago

I checked the GRSF VRE, groups are now available for source databases (ram, firms, fishsource) and also record pages are enriched with the box "Data and resources", similarly to GRSF Admin vre. Many thanks

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 8.91 MB)