[ic] Call for testers

Peter peter at pajamian.dhs.org
Fri Mar 13 02:55:11 UTC 2009

On 03/12/2009 07:31 PM, David Christensen wrote:
> There may be other options that we can look  
> at, such as a directive that indicates a fallback encoding of any  
> existing catalog files, so if we fail on the initial utf-8 decode we  
> fall back to that.  That would allow us to catch and log the fact that  
> a legacy encoding was encountered, but at the same time would allow us  
> to properly decode the data in question without resorting to  
> substitution.  I suspect this failure would occur when trying to read  
> in the data with read_file rather than at regexp match time, but the  
> solution/logic still holds.

That sounds like a good idea.  Basically put, we eval in a couple
different places and if the eval fails we can assume that the data is
Latin-1 and then convert it to UTF8 based on that.  Perl should then
have valid UTF8 and stop complaining and the data will (we hope) look
like it's supposed to.  It might be useful to have a directive that
indicates what the page encoding is as well, then we can dispense with
the eval and just assume that all pages are encoded as per the directive
and convert to UTF8.  This would happen in read_file, then.  For best
backwards compatibility I think it would be best to ignore this
directive if MV_UTF8 is not set.

> I think the abovementioned directive solves both issues; if MV_UTF8 is  
> off and/or the legacy encoding is not defined, we could fall back to  
> raw octets, like in 2.

if MV_UTF8 is off this is not an issue since Perl will treat everything
as raw octets anyways and we should not try to change anything or we
risk breaking backwards compatibility.

As a side note, I'm now thinking that making MV_UTF8 a variable may have
been a mistake.  I would much rather see it as a configuration
directive.  Same goes for MV_HTTP_CHARSET.  I wonder if it's too late to
change that?


More information about the interchange-users mailing list