[ic] Call for testers
david at endpoint.com
Fri Mar 13 02:31:29 UTC 2009
On Mar 12, 2009, at 8:12 PM, Peter wrote:
> On 03/12/2009 02:07 PM, David Christensen wrote:
>> On Mar 12, 2009, at 3:28 PM, Peter wrote:
>>>>>>> One thing which also annoys me is the internal server error
>>>>>>> non UTF-8 characters:
>>>>>>> 127.0.1.1 ZobI6Yf4:127.0.1.1 - [12/March/2009:09:24:20 +0100]
>>>>>>> /cgi-bin/ic/ulisses/index Runtime error: Malformed UTF-8
>>>>>>> at /usr/lib/interchange/Vend/Parser.pm line 112.
>>> Well, if all we wanted to do was log and continue then all that is
>>> needed is to wrap a few lines of perl in an eval. Unfortunately, we
>>> also have to decide how to process the bad text that is causing the
>>> problem since if we just leave it then (1) we will have large chunks
>>> text missing of the resulting page as a result and (2) it is
>>> likely to
>>> fail again on the same string elsewhere. I think we really need
>>> to do
>>> something to sanitize the illegal characters (which may just be
>>> one or
>>> two chars) out of the text, then we can log and continue.
>> I can also envision situations where sanitizing the invalid
>> might cause problems depending on if it's in a code context rather
>> than just page text. I'm not sure there's an easy answer here, but
>> welcome to some ideas as how best to proceed.
> Well, looking at the code on line 112 of Parser.PM is simply the first
> place (I think) where Interchange runs any page text through a regexp.
> Presumably if we attempy to move on from here without first sanitizing
> the text we will get the same error on every other regexp that
> to parse the page text. Continuing to the logical conclusion is that
> the page will be impossible to parse and so we will either end up
> with a
> blank page (nothing) or a page that is identical to the original ITL.
> being spat out at the browser and a ton of errors being recorded to
> the log.
I think what you're saying is that if we try to eval the first regexp
block, we'll continue dying on subsequent regexp attempts. If so, I
agree with the outcome. There may be other options that we can look
at, such as a directive that indicates a fallback encoding of any
existing catalog files, so if we fail on the initial utf-8 decode we
fall back to that. That would allow us to catch and log the fact that
a legacy encoding was encountered, but at the same time would allow us
to properly decode the data in question without resorting to
substitution. I suspect this failure would occur when trying to read
in the data with read_file rather than at regexp match time, but the
solution/logic still holds.
> There are only two ways that we can continue and display anything
> reasonable, imo.
> 1. sanitize the input and proceed as discussed above.
> 2. Tell Perl that the text is not UTF8 encoded by turning off the
> flag for the string and then proceeding (you use the Encode::encode()
> function to turn off the utf8 flag).
I think the abovementioned directive solves both issues; if MV_UTF8 is
off and/or the legacy encoding is not defined, we could fall back to
raw octets, like in 2.
> Personally I would prefer 1 as I think the results will be closer to
> intended. I really don't think that many people, if anyone will be
> embedding binary data on their interchange pages, and if they do it
> should be encoded as base64 or something to prevent this sort of
Now that I think about it, I agree on the likelihood of binary data.
I was thinking more of *output* that was binary in a specific format,
rather than input being targetted by the parser again.
> interchange-users mailing list
> interchange-users at icdevgroup.org
End Point Corporation
david at endpoint.com
More information about the interchange-users