[ic] Call for testers

Thu Mar 12 21:07:49 UTC 2009

On Mar 12, 2009, at 3:28 PM, Peter wrote:

> On 03/12/2009 12:30 PM, David Christensen wrote:
>> On Mar 12, 2009, at 2:15 PM, Peter wrote:
>>
>>> On 03/12/2009 05:32 AM, David Christensen wrote:
>>>> <snip>
>>>>
>>>>> One thing which also annoys me is the internal server error caused
>>>>> by
>>>>> non UTF-8 characters:
>>>>>
>>>>> 127.0.1.1 ZobI6Yf4:127.0.1.1 - [12/March/2009:09:24:20 +0100]
>>>>> ulisses
>>>>> /cgi-bin/ic/ulisses/index Runtime error: Malformed UTF-8 character
>>>>> (fatal)
>>>>> at /usr/lib/interchange/Vend/Parser.pm line 112.
>>>> What is the text on the index page?  I'm assuming this was in some
>>>> legacy encoding and that MV_UTF8 was set to 1.  If MV_UTF8 is off,
>>>> this is a bug that should be addressed, as breaking legacy  
>>>> encodings
>>>> when MV_UTF8 is off is a Bad Thing.  One of the consequences of
>>>> setting MV_UTF8 is that it expects all of your pages, etc to be in
>>>> the
>>>> utf-8 encoding.
>>> While this is true, I don't think it's right to bring down a website
>>> because a page contains an invalid UTF8 character.  It should be
>>> logged
>>> as an error and dealt with as gracefully as possible.  One  
>>> solution is
>>> to use the Encode module to convert invalid characters to something
>>> like
>>> a ? or alternatively to just encode them as (invalid) html entities
>>> and
>>> push the problem off to the browser.
>>
>> Yeah, fatal is a bad result, we could see if there's a more forgiving
>> IO layer that can just log those and continue.  I believe most of
>> these cases are ushered through Vend::Util::read_file, so we may be
>> able to centralize decisions there.
>
> Well, if all we wanted to do was log and continue then all that is
> needed is to wrap a few lines of perl in an eval.  Unfortunately, we
> also have to decide how to process the bad text that is causing the
> problem since if we just leave it then (1) we will have large chunks  
> of
> text missing of the resulting page as a result and (2) it is likely to
> fail again on the same string elsewhere.  I think we really need to do
> something to sanitize the illegal characters (which may just be one or
> two chars) out of the text, then we can log and continue.

I can also envision situations where sanitizing the invalid characters  
might cause problems depending on if it's in a code context rather  
than just page text.  I'm not sure there's an easy answer here, but  
welcome to some ideas as how best to proceed.

Regards,

David
--
David Christensen
End Point Corporation
david at endpoint.com
212-929-6923
http://www.endpoint.com/