[ic] Call for testers

David Christensen david at endpoint.com
Fri Mar 13 02:31:29 UTC 2009

On Mar 12, 2009, at 8:12 PM, Peter wrote:

> On 03/12/2009 02:07 PM, David Christensen wrote:
>> On Mar 12, 2009, at 3:28 PM, Peter wrote:
>>>>>>> One thing which also annoys me is the internal server error  
>>>>>>> caused
>>>>>>> by
>>>>>>> non UTF-8 characters:
>>>>>>> ZobI6Yf4: - [12/March/2009:09:24:20 +0100]
>>>>>>> ulisses
>>>>>>> /cgi-bin/ic/ulisses/index Runtime error: Malformed UTF-8  
>>>>>>> character
>>>>>>> (fatal)
>>>>>>> at /usr/lib/interchange/Vend/Parser.pm line 112.
>>> Well, if all we wanted to do was log and continue then all that is
>>> needed is to wrap a few lines of perl in an eval.  Unfortunately, we
>>> also have to decide how to process the bad text that is causing the
>>> problem since if we just leave it then (1) we will have large chunks
>>> of
>>> text missing of the resulting page as a result and (2) it is  
>>> likely to
>>> fail again on the same string elsewhere.  I think we really need  
>>> to do
>>> something to sanitize the illegal characters (which may just be  
>>> one or
>>> two chars) out of the text, then we can log and continue.
>> I can also envision situations where sanitizing the invalid  
>> characters
>> might cause problems depending on if it's in a code context rather
>> than just page text.  I'm not sure there's an easy answer here, but
>> welcome to some ideas as how best to proceed.
> Well, looking at the code on line 112 of Parser.PM is simply the first
> place (I think) where Interchange runs any page text through a regexp.
> Presumably if we attempy to move on from here without first sanitizing
> the text we will get the same error on every other regexp that  
> attempts
> to parse the page text.  Continuing to the logical conclusion is that
> the page will be impossible to parse and so we will either end up  
> with a
> blank page (nothing) or a page that is identical to the original ITL.
> being spat out at the browser and a ton of errors being recorded to  
> the log.

I think what you're saying is that if we try to eval the first regexp  
block, we'll continue dying on subsequent regexp attempts.  If so, I  
agree with the outcome.  There may be other options that we can look  
at, such as a directive that indicates a fallback encoding of any  
existing catalog files, so if we fail on the initial utf-8 decode we  
fall back to that.  That would allow us to catch and log the fact that  
a legacy encoding was encountered, but at the same time would allow us  
to properly decode the data in question without resorting to  
substitution.  I suspect this failure would occur when trying to read  
in the data with read_file rather than at regexp match time, but the  
solution/logic still holds.

> There are only two ways that we can continue and display anything
> reasonable, imo.
> 1.  sanitize the input and proceed as discussed above.
> 2.  Tell Perl that the text is not UTF8 encoded by turning off the  
> utf8
> flag for the string and then proceeding (you use the Encode::encode()
> function to turn off the utf8 flag).

I think the abovementioned directive solves both issues; if MV_UTF8 is  
off and/or the legacy encoding is not defined, we could fall back to  
raw octets, like in 2.

> Personally I would prefer 1 as I think the results will be closer to
> intended.  I really don't think that many people, if anyone will be
> embedding binary data on their interchange pages, and if they do it
> should be encoded as base64 or something to prevent this sort of  
> issue.

Now that I think about it, I agree on the likelihood of binary data.   
I was thinking more of *output* that was binary in a specific format,  
rather than input being targetted by the parser again.

> Peter
> _______________________________________________
> interchange-users mailing list
> interchange-users at icdevgroup.org
> http://www.icdevgroup.org/mailman/listinfo/interchange-users


David Christensen
End Point Corporation
david at endpoint.com

More information about the interchange-users mailing list