Discussion:
XMLParser Claims U+00A0 is “Invalid UTF-8”
(too old to reply)
Sean P. DeNigris
2016-07-28 19:12:04 UTC
Permalink
Posted to StackOverflow
(https://stackoverflow.com/questions/38645553/xmlparser-in-pharo-claims-u00a0-is-invalid-utf-8):



Given the input:

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<sms body=". what" />

Where the character after the "." in the body attribute of the sms tag is
U+00A0;

I get the error:

XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column
13)

IIUC, the UTF-8 representation of that character is 0xC2 0xA0 per Wikipedia.
Sure enough, bytes 72 and 73 of the input are 194 and 160 respectively.

This seems like a bug in XMLParser, or am I missing something?




-----
Cheers,
Sean
--
View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
monty
2016-07-28 20:40:36 UTC
Permalink
Just to be sure, I manually recreated your file (with the great Bless hex editor) and parsed it with no issue.

Please post your code and attach the actual source as a file separately.

> Sent: Thursday, July 28, 2016 at 3:12 PM
> From: "Sean P. DeNigris" <***@clipperadams.com>
> To: pharo-***@lists.pharo.org
> Subject: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> Posted to StackOverflow
> (https://stackoverflow.com/questions/38645553/xmlparser-in-pharo-claims-u00a0-is-invalid-utf-8):
>
>
>
> Given the input:
>
> <?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
> <sms body=". what" />
>
> Where the character after the "." in the body attribute of the sms tag is
> U+00A0;
>
> I get the error:
>
> XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column
> 13)
>
> IIUC, the UTF-8 representation of that character is 0xC2 0xA0 per Wikipedia.
> Sure enough, bytes 72 and 73 of the input are 194 and 160 respectively.
>
> This seems like a bug in XMLParser, or am I missing something?
>
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>
Sean P. DeNigris
2016-07-28 20:05:40 UTC
Permalink
monty-3 wrote
> Just to be sure, I manually recreated your file (with the great Bless hex
> editor) and parsed it with no issue.

Thanks!


monty-3 wrote
> Please post your code and attach the actual source as a file separately.

The code is merely:
messageLog := FileLocator home / 'illegal-UTF-sms.xml'.
doc := XMLDOMParser parse: messageLog.

File: illegal-UTF-sms.xml
<http://forum.world.st/file/n4908531/illegal-UTF-sms.xml>



-----
Cheers,
Sean
--
View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
Sven Van Caekenberghe
2016-07-28 21:04:30 UTC
Permalink
Sean,

Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way it is served from the URL you gave.

(('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) at: 72 ) = 160 asCharacter.

"true"

Like you said,

160 asCharacter asString utf8Encoded.

"#[194 160]"

But

#[ 160 ] utf8Decoded.

Boom!

You specify UTF-8 encoding inside your XML, I assume the parser then switches to that encoding, but your pure Unicode contents is not UTF-8 encoded and results in an exception. You see ?

Sven

> On 28 Jul 2016, at 22:05, Sean P. DeNigris <***@clipperadams.com> wrote:
>
> monty-3 wrote
>> Just to be sure, I manually recreated your file (with the great Bless hex
>> editor) and parsed it with no issue.
>
> Thanks!
>
>
> monty-3 wrote
>> Please post your code and attach the actual source as a file separately.
>
> The code is merely:
> messageLog := FileLocator home / 'illegal-UTF-sms.xml'.
> doc := XMLDOMParser parse: messageLog.
>
> File: illegal-UTF-sms.xml
> <http://forum.world.st/file/n4908531/illegal-UTF-sms.xml>
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
Sean P. DeNigris
2016-07-28 20:29:21 UTC
Permalink
Sven Van Caekenberghe-2 wrote
> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
> it is served from the URL you gave.
> ..
> You see ?

Unfortunately, no! ha ha. I didn't generate the file and I took it's
assertion that it was UTF-8 at face value. How do I properly feed the file
into XMLParser?



-----
Cheers,
Sean
--
View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
Sven Van Caekenberghe
2016-07-28 21:29:26 UTC
Permalink
In my older work image, the following just works:

XMLDOMParser parse:
('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents).

But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient.

You could try to edit the incoming file, or have a look at #decodesCharacters:

(XMLDOMParser on:
('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument.

But I am no expert in the deeper aspects of XML Support.

> On 28 Jul 2016, at 22:29, Sean P. DeNigris <***@clipperadams.com> wrote:
>
> Sven Van Caekenberghe-2 wrote
>> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
>> it is served from the URL you gave.
>> ..
>> You see ?
>
> Unfortunately, no! ha ha. I didn't generate the file and I took it's
> assertion that it was UTF-8 at face value. How do I properly feed the file
> into XMLParser?
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
monty
2016-07-28 22:15:57 UTC
Permalink
Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc eagerly decoding the response without looking at the <?xml ?> declaration as the XML spec requires.

#parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on top of it.

> Sent: Thursday, July 28, 2016 at 5:29 PM
> From: "Sven Van Caekenberghe" <***@stfx.eu>
> To: "Any question about pharo is welcome" <pharo-***@lists.pharo.org>
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> In my older work image, the following just works:
>
> XMLDOMParser parse:
> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents).
>
> But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient.
>
> You could try to edit the incoming file, or have a look at #decodesCharacters:
>
> (XMLDOMParser on:
> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument.
>
> But I am no expert in the deeper aspects of XML Support.
>
> > On 28 Jul 2016, at 22:29, Sean P. DeNigris <***@clipperadams.com> wrote:
> >
> > Sven Van Caekenberghe-2 wrote
> >> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
> >> it is served from the URL you gave.
> >> ..
> >> You see ?
> >
> > Unfortunately, no! ha ha. I didn't generate the file and I took it's
> > assertion that it was UTF-8 at face value. How do I properly feed the file
> > into XMLParser?
> >
> >
> >
> > -----
> > Cheers,
> > Sean
> > --
> > View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
> > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> >
>
>
>
monty
2016-07-28 22:23:06 UTC
Permalink
Also #parseURL:/#onURL: will use WebClient on Squeak (unless Zinc is present of course)

> Sent: Thursday, July 28, 2016 at 6:15 PM
> From: monty <***@programmer.net>
> To: pharo-***@lists.pharo.org
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc eagerly decoding the response without looking at the <?xml ?> declaration as the XML spec requires.
>
> #parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on top of it.
>
> > Sent: Thursday, July 28, 2016 at 5:29 PM
> > From: "Sven Van Caekenberghe" <***@stfx.eu>
> > To: "Any question about pharo is welcome" <pharo-***@lists.pharo.org>
> > Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
> >
> > In my older work image, the following just works:
> >
> > XMLDOMParser parse:
> > ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents).
> >
> > But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient.
> >
> > You could try to edit the incoming file, or have a look at #decodesCharacters:
> >
> > (XMLDOMParser on:
> > ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument.
> >
> > But I am no expert in the deeper aspects of XML Support.
> >
> > > On 28 Jul 2016, at 22:29, Sean P. DeNigris <***@clipperadams.com> wrote:
> > >
> > > Sven Van Caekenberghe-2 wrote
> > >> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
> > >> it is served from the URL you gave.
> > >> ..
> > >> You see ?
> > >
> > > Unfortunately, no! ha ha. I didn't generate the file and I took it's
> > > assertion that it was UTF-8 at face value. How do I properly feed the file
> > > into XMLParser?
> > >
> > >
> > >
> > > -----
> > > Cheers,
> > > Sean
> > > --
> > > View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
> > > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> > >
> >
> >
> >
>
>
Sven Van Caekenberghe
2016-07-28 22:45:10 UTC
Permalink
> On 29 Jul 2016, at 00:15, monty <***@programmer.net> wrote:
>
> Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc eagerly decoding the response without looking at the <?xml ?> declaration as the XML spec requires.
>
> #parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on top of it.

Yes, you are right. Thanks for implementing all this logic, I known it is quite complicated and tricky.

>> Sent: Thursday, July 28, 2016 at 5:29 PM
>> From: "Sven Van Caekenberghe" <***@stfx.eu>
>> To: "Any question about pharo is welcome" <pharo-***@lists.pharo.org>
>> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>>
>> In my older work image, the following just works:
>>
>> XMLDOMParser parse:
>> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents).
>>
>> But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient.
>>
>> You could try to edit the incoming file, or have a look at #decodesCharacters:
>>
>> (XMLDOMParser on:
>> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument.
>>
>> But I am no expert in the deeper aspects of XML Support.
>>
>>> On 28 Jul 2016, at 22:29, Sean P. DeNigris <***@clipperadams.com> wrote:
>>>
>>> Sven Van Caekenberghe-2 wrote
>>>> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
>>>> it is served from the URL you gave.
>>>> ..
>>>> You see ?
>>>
>>> Unfortunately, no! ha ha. I didn't generate the file and I took it's
>>> assertion that it was UTF-8 at face value. How do I properly feed the file
>>> into XMLParser?
>>>
>>>
>>>
>>> -----
>>> Cheers,
>>> Sean
>>> --
>>> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>
>>
>>
>>
>
monty
2016-07-28 21:44:26 UTC
Permalink
You're double decoding. Use onFileNamed:/parseFileNamed: instead (and the DOM printToFileNamed: family of messages when writing) and let XMLParser take care this for you, or disable XMLParser decoding before parsing with #decodesCharacters:.

Longer explanation:

The class #on:/#parse: take either a string or a stream (read the definitions). You gave it a FileReference, but because the argument is tested with isString and sent #readStream otherwise, it didn't blowup then.

File refs sent #readStream return file streams that do automatic decoding. But XMLParser automatically attempts its own decoding too, if:

The input starts with a BOM or it can be inferred by null bytes before or after the first non-null byte.

There is an encoding declaration with a non-UTF-8 encoding.

There is a UTF-8 encoding declaration but the stream is not a normal ReadStream (your case).

So it gets decoded twice, and the decoded value of the char causes the error. I'll consider changing the heuristic to make less eager to decode.

> Sent: Thursday, July 28, 2016 at 4:05 PM
> From: "Sean P. DeNigris" <***@clipperadams.com>
> To: pharo-***@lists.pharo.org
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> monty-3 wrote
> > Just to be sure, I manually recreated your file (with the great Bless hex
> > editor) and parsed it with no issue.
>
> Thanks!
>
>
> monty-3 wrote
> > Please post your code and attach the actual source as a file separately.
>
> The code is merely:
> messageLog := FileLocator home / 'illegal-UTF-sms.xml'.
> doc := XMLDOMParser parse: messageLog.
>
> File: illegal-UTF-sms.xml
> <http://forum.world.st/file/n4908531/illegal-UTF-sms.xml>
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>
Sean P. DeNigris
2016-07-28 21:13:16 UTC
Permalink
monty-3 wrote
> You're double decoding

And in public, no less! Thanks. It works now with #parseFileNamed:. Minus
side - half a day wasted; plus side - I wrote a compatibility layer for
Magritte-XMLBinding to accept SoupTags to #fromXmlNode:



-----
Cheers,
Sean
--
View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908555.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
Loading...