Project

General

Profile

Actions

Task #12227

closed

Encoding issue on Dataminer proto 5

Added by Gianpaolo Coro almost 7 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
_InfraScience Systems Engineer
Category:
High-Throughput-Computing
Target version:
Start date:
Jul 24, 2018
Due date:
% Done:

100%

Estimated time:
Infrastructure:
Production

Description

I have a difficult issue on one of the prototype Dataminers for which I need help:

There are several algorithms of text analysis that read input files in UTF-8 encoding and write json files in UTF-8. They run in R.

Only on dataminer5-proto, the UTF-8 file write crashes when there is a stressed character in the text (e.g. "oggi รจ una bella giornata"). The reported error is generic:

Warning message:
In writeLines(json, fileConn) : invalid char string in output conversion

writeLines is a native R function and is invoked correctly in the code:

fileConn<-file(outjsonfile,encoding = "UTF-8")
writeLines(json, fileConn)
close(fileConn)

Input files are plain text files read as UTF-8 files using bytes:

inputFile <- file(inputfile, encoding="UTF-8")
filetext<-readChar(inputFile, file.info(inputfile)$size, useBytes = T)

The only package used by the algorithms is "jsonlite".
I have checked the machine and R locales but they seem OK. Perhaps there is some other difference in the locales I cannot see.
From sample tests, this issue occurs only on dataminer5-proto.


Files

Actions #1

Updated by Gianpaolo Coro almost 7 years ago

For the time being, we are going to stop dataminer5-proto in order to make the NLP Hub work.

Actions #2

Updated by Andrea Dell'Amico almost 7 years ago

  • Status changed from New to In Progress

I just compared the relevant parts of dataminer4-proto and dataminer5-proto without founding any difference:

  • environment variables
  • R version
  • version of the jsonlite R package
  • tomcat options
  • smartgears version

are the same on both servers. Is there a way to run a test from command line, so that I can trace the execution?

Actions #3

Updated by Gianpaolo Coro almost 7 years ago

Yes, the issue is weird but this morning I have verified it is just on that machine. The process that highlighted the issue sends an XML file via POST to the DM, which contains the UTF-8 text. Indeed, I had thought there was something in the tomcat locale that saved the file in non-UTF format on the DM. The POST request is done by another DM algorithm and reproducing it as a standalone call could be a bit long.

A direct test using a file on the Workspace can be done using this link:

http://dataminer5-proto.d4science.org/wps/WebProcessingService?request=Execute&service=WPS&Version=1.0.0&gcube-token=<token>&lang=en-US&Identifier=org.gcube.dataanalysis.wps.statisticalmanager.synchserver.mappedclasses.transducerers.TAGME_ITALIAN_NER&DataInputs=inputfile=https%3A%2F%2Fdata.d4science.org%2FRkc1VUJ2ZDdMUGowTnkramdGcUpMcDBFV2JlODh4SEpHbWJQNStIS0N6Yz0;

but I'm not sure the issue will manifest. Perhaps it would be easier to reinstall the machine from scratch, otherwise I will need some time to setup a proper test.

Actions #4

Updated by Andrea Dell'Amico almost 7 years ago

Gianpaolo Coro wrote:

Yes, the issue is weird but this morning I have verified it is just on that machine. The process that highlighted the issue sends an XML file via POST to the DM, which contains the UTF-8 text. Indeed, I had thought there was something in the tomcat locale that saved the file in non-UTF format on the DM. The POST request is done by another DM algorithm and reproducing it as a standalone call could be a bit long.

A direct test using a file on the Workspace can be done using this link:

http://dataminer5-proto.d4science.org/wps/WebProcessingService?request=Execute&service=WPS&Version=1.0.0&gcube-token=<token>&lang=en-US&Identifier=org.gcube.dataanalysis.wps.statisticalmanager.synchserver.mappedclasses.transducerers.TAGME_ITALIAN_NER&DataInputs=inputfile=https%3A%2F%2Fdata.d4science.org%2FRkc1VUJ2ZDdMUGowTnkramdGcUpMcDBFV2JlODh4SEpHbWJQNStIS0N6Yz0;

but I'm not sure the issue will manifest.

It isn't reproducible regularly? I'm going to remove the host from the load balancer and start the tomcat instance again, so that I can run some tests.

Perhaps it would be easier to reinstall the machine from scratch, otherwise I will need some time to setup a proper test.

Well, if we don't know why there's such a problem, reinstalling is not a guarantee. The VM was installed at the same time as dataminer4-proto and apparently there is no difference between the two.

Actions #5

Updated by Andrea Dell'Amico almost 7 years ago

The test you posted does not fail. The xml response is:

<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ows="http://www.opengis.net/ows/1.1" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 http://schemas.opengis.net/wps/1.0.0/wpsExecute_response.xsd" serviceInstance="http://dataminer5-proto.d4science.org:80//wps/WebProcessingService" xml:lang="en-US" service="WPS" version="1.0.0">
<wps:Process wps:processVersion="1.1.0">
<ows:Identifier>
org.gcube.dataanalysis.wps.statisticalmanager.synchserver.mappedclasses.transducerers.TAGME_ITALIAN_NER
</ows:Identifier>
<ows:Title>TAGME_ITALIAN_NER</ows:Title>
</wps:Process>
<wps:Status creationTime="2018-07-26T14:46:53.740+02:00">
<wps:ProcessSucceeded>Process successful</wps:ProcessSucceeded>
</wps:Status>
<wps:ProcessOutputs>
<wps:Output>
<ows:Identifier>non_deterministic_output</ows:Identifier>
<ows:Title>NonDeterministicOutput</ows:Title>
<wps:Data>
<wps:ComplexData schema="http://schemas.opengis.net/gml/2.1.2/feature.xsd" mimeType="text/xml; subtype=gml/2.1.2">
<ogr:FeatureCollection xmlns:ogr="http://ogr.maptools.org/" xmlns:gml="http://www.opengis.net/gml" xmlns:d4science="http://www.d4science.org" xsi:schemaLocation="http://ogr.maptools.org/ result_8751.xsd">
<gml:featureMember>
<ogr:Result fid="F0">
<d4science:Data>
http://data.d4science.org/Q2JqWHY2WlU1RWQ3eGEreHo2MmtPb3lnVUtsYXpLTUxHbWJQNStIS0N6Yz0-VLT
</d4science:Data>
<d4science:Description>Log of the computation</d4science:Description>
<d4science:MimeType>text/csv</d4science:MimeType>
</ogr:Result>
<ogr:Result fid="F1">
<d4science:Data>
http://data.d4science.org/Q2JqWHY2WlU1RWQ3eGEreHo2MmtPa2ZZRXM4bmIvVEhHbWJQNStIS0N6Yz0-VLT
</d4science:Data>
<d4science:Description>outjsonfile</d4science:Description>
<d4science:MimeType>application/d4science</d4science:MimeType>
</ogr:Result>
</gml:featureMember>
</ogr:FeatureCollection>
</wps:ComplexData>
</wps:Data>
</wps:Output>
</wps:ProcessOutputs>
</wps:ExecuteResponse>

The result output file is attached.

Actions #6

Updated by Andrea Dell'Amico almost 7 years ago

Run more than once, it never failed.

Actions #7

Updated by Andrea Dell'Amico almost 7 years ago

@gianpaolo.coro@isti.cnr.it can you run some different test? I cannot explain the behaviour, the errors start on July 20th and last until tomcat was shutdown on July 24th. Did you try a tomcat restart in between?

Actions #8

Updated by Gianpaolo Coro almost 7 years ago

Hi, the tomcat had been restarted. The fact that the test works enforces my guess that there is something at tomcat level, i.e. it occurs when an UTF-8 text is sent directly to the service via POST without passing from the Workspace.

I will be on vacation from next wee, thus I don't have time to assemble a more detailed test. Thus, either we wait after 21 August or you could reinstall the machine.

Actions #9

Updated by Andrea Dell'Amico almost 7 years ago

Gianpaolo Coro wrote:

Hi, the tomcat had been restarted. The fact that the test works enforces my guess that there is something at tomcat level, i.e. it occurs when an UTF-8 text is sent directly to the service via POST without passing from the Workspace.

I will be on vacation from next wee, thus I don't have time to assemble a more detailed test. Thus, either we wait after 21 August or you could reinstall the machine.

I want to understand what's happening. If there's something broken at the tomcat level, the only possibility is a wrong manual intervention from someone.
Because, again, the VM have to be identical to dataminer4-proto (and to all the other tomcat installations, FYI).

Actions #10

Updated by Andrea Dell'Amico almost 7 years ago

  • Tracker changed from Incident to Task
Actions #11

Updated by Andrea Dell'Amico over 6 years ago

Can we restart this activity?

Actions #12

Updated by Gianpaolo Coro over 6 years ago

I'm going to build a test for Dataminer 5.

Actions #13

Updated by Gianpaolo Coro over 6 years ago

I cannot reproduce the issue systematically because it seems to be random. Is it possible to re-install the machine?

Actions #14

Updated by Gianpaolo Coro over 6 years ago

Is it possible to check that the following information is aligned on all dataminers?

environment variables
R version
R locale
tomcat locale
machine locale
version of the jsonlite R package
tomcat options
smartgears version
Actions #15

Updated by Gianpaolo Coro over 6 years ago

After @andrea.dellamico@isti.cnr.it has added this forcing in the Rprofile.site file:

readRenviron("/etc/default/locale")
LANG <- Sys.getenv("LANG")
if(nchar(LANG))
   Sys.setlocale("LC_ALL", LANG)

the problem does not occur anymore. Why this operation had to be forced?

Actions #16

Updated by Andrea Dell'Amico over 6 years ago

  • % Done changed from 20 to 80

Gianpaolo Coro wrote:

the problem does not occur anymore. Why this operation had to be forced?

You lamented a locale problem, so I looked for a way to explicitly set the R locale. The above commands set the R locale to be the same as the system one, that is en_US.UTF-8 on all our systems (we explicitly set that one too).

I'm going to add the commands to our Rprofile.site template.

Actions #17

Updated by Andrea Dell'Amico over 6 years ago

  • Status changed from In Progress to Feedback
  • % Done changed from 80 to 100

Done. As the problem was impacting dataminer5-proto.d4science.org only, the new version of Rprofile.site will be provisioned during the next infrastructure upgrade.

Actions #18

Updated by Andrea Dell'Amico over 6 years ago

  • Status changed from Feedback to Closed
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 8.91 MB)