Issue with tilde characters (ñ or Ñ) when inside a CDATA section during Snapshot Load

EntityDataLoaderImpl.groovy fails when (ñ or Ñ) are inside the CDATA section of a snapshot. Elsewhere (e.g.

<?xml version="1.0" encoding="UTF-8"?>
<entity-facade-xml type="bonk2">
    <moqui.screen.form.FormResponseAnswer formId="UsaIrsW2_2020_L4UPA" lastUpdatedStamp="1673064414627" fieldName="w2_f" formResponseId="102185" formResponseAnswerId="129972">
        <valueText><![CDATA[C. AGUSTIN MELGAR ## 109
COL NIÑOS HEROES
VICTORIA, TM 87089 MEX]]></valueText>
    </moqui.screen.form.FormResponseAnswer>
</entity-facade-xml>

Whereas this imports fine in the snapshot:

<mantle.party.contact.PostalAddress city="VICTORIA" postalCode="87089" unitNumber="# 109" contactMechId="110101" toName="FERMIN RAMIREZ-MEDINA" countryGeoId="MEX" lastUpdatedStamp="1661210754521" address2="COL NI&#209;OS HEROES" address1="C. AGUSTIN MELGAR" stateProvinceGeoId="MEX_TM"/>
Transaction rollback. The rollback was originally caused by: Error running transition in [http://localhost:8080/apps/tools/Entity/DataImport/load]
org.moqui.BaseException: Error loading entity data from file:/Users/sbessire/moqui/runtime/component/gebbers/data/test.xml
	at org.moqui.impl.entity.EntityDataLoaderImpl.loadSingleFile(EntityDataLoaderImpl.groovy:374) ~[moqui-framework-2.1.2-rc2.jar:2.1.2-rc2]
	at org.moqui.impl.entity.EntityDataLoaderImpl$_internalRun_closure1.doCall(EntityDataLoaderImpl.groovy:291) ~[moqui-framework-2.1.2-rc2.jar:2.1.2-rc2]
	at org.moqui.impl.context.TransactionFacadeImpl$_runRequireNew_closure1.doCall(TransactionFacadeImpl.groovy:196) ~[moqui-framework-2.1.2-rc2.jar:2.1.2-rc2]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_271]
Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequence.

I am not sure if I can attach a file here? If you just try to paste the above text in the XML section of DataImport.xml, it loads fine…but, if it exists as it came out of the snapshot, it does not.

Where did the file that is not working come from? This is a bit puzzling, but one possibility that comes to mind is that the file is not really UTF-8 encoded even though it says it is in the header. A text editor might open it fine because it detects the file encoding and ignores the header, but Moqui doesn’t do that… I think it assumes UTF-8 right now.

The file is from moqui.screen.form.FormResponseAnswer.xml that came from a DataSnapot’s Export Snapshot dialog (with File Per Entity true).

Loading FormResponseAnswer was where it failed on the data load of the snapshot (this was the first time creating a W2 for this employee). I took that file and loaded it individually while debugging and eventually just pared it down to the one record giving me grief.

The Form Responses where generated while creating W2s. This is the definition of the form field element (it is a DBFFT_text-area)

        <moqui.screen.form.DbFormField fieldName="w2_f" title="f Employee's address and ZIP code"
                                       layoutSequenceNum="8" fieldTypeEnumId="DBFFT_text-area" printTop="2.417in" printLeft="0.275in" printWidth="" printHeight="">
            <moqui.screen.form.DbFormFieldAttribute attributeName="cols" value="60"/>
            <moqui.screen.form.DbFormFieldAttribute attributeName="rows" value="4"/>
            <moqui.screen.form.DbFormFieldAttribute attributeName="read-only" value="true"/>
        </moqui.screen.form.DbFormField>

The data in this field is the employee’s home address, generated as:

            <script>employeeHomeStreetString = """${employeeHomeContactInfo?.postalAddress?.address1 ? employeeHomeContactInfo.postalAddress.address1 + (employeeHomeContactInfo.postalAddress.unitNumber ? ' #' + employeeHomeContactInfo.postalAddress.unitNumber : '') : ''}${employeeHomeContactInfo?.postalAddress?.address2 ? '\n' + employeeHomeContactInfo.postalAddress.address2 : ''}"""</script>
            <script>employeeHomeCszString = """${employeeHomeContactInfo?.postalAddress ? (employeeHomeContactInfo.postalAddress.city ?: '') + (employeeHomeContactInfo.postalAddressStateGeo?.geoCodeAlpha2 ? ', ' + employeeHomeContactInfo.postalAddressStateGeo.geoCodeAlpha2 : '') + ' ' + (employeeHomeContactInfo.postalAddress.postalCode ?: '') + (employeeHomeContactInfo.postalAddress.postalCodeExt ? '-' + employeeHomeContactInfo.postalAddress.postalCodeExt : '') + (employeeHomeContactInfo.postalAddressCountryGeo?.geoCodeAlpha3 ? ' ' + employeeHomeContactInfo.postalAddressCountryGeo.geoCodeAlpha3 : '') : ''}"""</script>
            <script>employeeHomeString = """${employeeHomeStreetString}\n${employeeHomeCszString}"""</script>

Of note, the exported file did put all of the dbform text areas in a CDATA sections…and unlike the exported Postal Address where it did & # 209 ; , it encoded it as 0xD1 (209) in this exported file.

Maybe I need to chase down why during the export it decided to do CDATA section or why inside one it decided not to do an entity expansion?

Is this an xml parsing error?

Ultimately yes, the SAX parser gives up on 0xD1 in the CDATA section.

But, whether it should be 0xD1 and the SAX parser needs another flag…or whether it be some other output from the snapshot export that encodes Ñ differently I suspect is the question.

For the time being, I changed the data to just be an N instead of Ñ so I could continue to load snapshots from the customer.

I guess next steps are to see whether it’s a fasterxml (jackson) problem or a moqui problem. It’s probably a fasterxml problem, but could be a moqui one.

If you figure it out, feel free to submit a PR.