[ic] Call for testers

Fri Mar 13 01:12:43 UTC 2009

On 03/12/2009 02:07 PM, David Christensen wrote:
> On Mar 12, 2009, at 3:28 PM, Peter wrote:

>>>>>> One thing which also annoys me is the internal server error caused
>>>>>> by
>>>>>> non UTF-8 characters:
>>>>>>
>>>>>> 127.0.1.1 ZobI6Yf4:127.0.1.1 - [12/March/2009:09:24:20 +0100]
>>>>>> ulisses
>>>>>> /cgi-bin/ic/ulisses/index Runtime error: Malformed UTF-8 character
>>>>>> (fatal)
>>>>>> at /usr/lib/interchange/Vend/Parser.pm line 112.

>> Well, if all we wanted to do was log and continue then all that is
>> needed is to wrap a few lines of perl in an eval.  Unfortunately, we
>> also have to decide how to process the bad text that is causing the
>> problem since if we just leave it then (1) we will have large chunks  
>> of
>> text missing of the resulting page as a result and (2) it is likely to
>> fail again on the same string elsewhere.  I think we really need to do
>> something to sanitize the illegal characters (which may just be one or
>> two chars) out of the text, then we can log and continue.
> 
> 
> I can also envision situations where sanitizing the invalid characters  
> might cause problems depending on if it's in a code context rather  
> than just page text.  I'm not sure there's an easy answer here, but  
> welcome to some ideas as how best to proceed.

Well, looking at the code on line 112 of Parser.PM is simply the first
place (I think) where Interchange runs any page text through a regexp.
Presumably if we attempy to move on from here without first sanitizing
the text we will get the same error on every other regexp that attempts
to parse the page text.  Continuing to the logical conclusion is that
the page will be impossible to parse and so we will either end up with a
blank page (nothing) or a page that is identical to the original ITL.
being spat out at the browser and a ton of errors being recorded to the log.

There are only two ways that we can continue and display anything
reasonable, imo.

1.  sanitize the input and proceed as discussed above.

2.  Tell Perl that the text is not UTF8 encoded by turning off the utf8
flag for the string and then proceeding (you use the Encode::encode()
function to turn off the utf8 flag).

Personally I would prefer 1 as I think the results will be closer to
intended.  I really don't think that many people, if anyone will be
embedding binary data on their interchange pages, and if they do it
should be encoded as base64 or something to prevent this sort of issue.

Peter