[ic] Call for testers
david at endpoint.com
Thu Mar 12 21:07:49 UTC 2009
On Mar 12, 2009, at 3:28 PM, Peter wrote:
> On 03/12/2009 12:30 PM, David Christensen wrote:
>> On Mar 12, 2009, at 2:15 PM, Peter wrote:
>>> On 03/12/2009 05:32 AM, David Christensen wrote:
>>>>> One thing which also annoys me is the internal server error caused
>>>>> non UTF-8 characters:
>>>>> 127.0.1.1 ZobI6Yf4:127.0.1.1 - [12/March/2009:09:24:20 +0100]
>>>>> /cgi-bin/ic/ulisses/index Runtime error: Malformed UTF-8 character
>>>>> at /usr/lib/interchange/Vend/Parser.pm line 112.
>>>> What is the text on the index page? I'm assuming this was in some
>>>> legacy encoding and that MV_UTF8 was set to 1. If MV_UTF8 is off,
>>>> this is a bug that should be addressed, as breaking legacy
>>>> when MV_UTF8 is off is a Bad Thing. One of the consequences of
>>>> setting MV_UTF8 is that it expects all of your pages, etc to be in
>>>> utf-8 encoding.
>>> While this is true, I don't think it's right to bring down a website
>>> because a page contains an invalid UTF8 character. It should be
>>> as an error and dealt with as gracefully as possible. One
>>> solution is
>>> to use the Encode module to convert invalid characters to something
>>> a ? or alternatively to just encode them as (invalid) html entities
>>> push the problem off to the browser.
>> Yeah, fatal is a bad result, we could see if there's a more forgiving
>> IO layer that can just log those and continue. I believe most of
>> these cases are ushered through Vend::Util::read_file, so we may be
>> able to centralize decisions there.
> Well, if all we wanted to do was log and continue then all that is
> needed is to wrap a few lines of perl in an eval. Unfortunately, we
> also have to decide how to process the bad text that is causing the
> problem since if we just leave it then (1) we will have large chunks
> text missing of the resulting page as a result and (2) it is likely to
> fail again on the same string elsewhere. I think we really need to do
> something to sanitize the illegal characters (which may just be one or
> two chars) out of the text, then we can log and continue.
I can also envision situations where sanitizing the invalid characters
might cause problems depending on if it's in a code context rather
than just page text. I'm not sure there's an easy answer here, but
welcome to some ideas as how best to proceed.
End Point Corporation
david at endpoint.com
More information about the interchange-users