[ic] Call for testers

Peter peter at pajamian.dhs.org
Thu Mar 12 20:28:15 UTC 2009

On 03/12/2009 12:30 PM, David Christensen wrote:
> On Mar 12, 2009, at 2:15 PM, Peter wrote:
>> On 03/12/2009 05:32 AM, David Christensen wrote:
>>> <snip>
>>>> One thing which also annoys me is the internal server error caused  
>>>> by
>>>> non UTF-8 characters:
>>>> ZobI6Yf4: - [12/March/2009:09:24:20 +0100]  
>>>> ulisses
>>>> /cgi-bin/ic/ulisses/index Runtime error: Malformed UTF-8 character
>>>> (fatal)
>>>> at /usr/lib/interchange/Vend/Parser.pm line 112.
>>> What is the text on the index page?  I'm assuming this was in some
>>> legacy encoding and that MV_UTF8 was set to 1.  If MV_UTF8 is off,
>>> this is a bug that should be addressed, as breaking legacy encodings
>>> when MV_UTF8 is off is a Bad Thing.  One of the consequences of
>>> setting MV_UTF8 is that it expects all of your pages, etc to be in  
>>> the
>>> utf-8 encoding.
>> While this is true, I don't think it's right to bring down a website
>> because a page contains an invalid UTF8 character.  It should be  
>> logged
>> as an error and dealt with as gracefully as possible.  One solution is
>> to use the Encode module to convert invalid characters to something  
>> like
>> a ? or alternatively to just encode them as (invalid) html entities  
>> and
>> push the problem off to the browser.
> Yeah, fatal is a bad result, we could see if there's a more forgiving  
> IO layer that can just log those and continue.  I believe most of  
> these cases are ushered through Vend::Util::read_file, so we may be  
> able to centralize decisions there.

Well, if all we wanted to do was log and continue then all that is
needed is to wrap a few lines of perl in an eval.  Unfortunately, we
also have to decide how to process the bad text that is causing the
problem since if we just leave it then (1) we will have large chunks of
text missing of the resulting page as a result and (2) it is likely to
fail again on the same string elsewhere.  I think we really need to do
something to sanitize the illegal characters (which may just be one or
two chars) out of the text, then we can log and continue.


More information about the interchange-users mailing list