[ic] Call for testers

David Christensen david at endpoint.com
Sat Mar 14 02:22:24 UTC 2009

On Mar 13, 2009, at 5:56 PM, Peter wrote:

> On 03/13/2009 06:09 AM, David Christensen wrote:
>> On Mar 13, 2009, at 4:29 AM, Peter wrote:
>>>> and if it's enabled, see any invalid UTF-8 bytes converted to ?
>>>> characters. That's simple, nonfatal at runtime, and yet gently
>>>> encourages
>>>> developers to get their sources in the proper UTF-8 encoding.
>>> I'm fine with that, and that was the original proposal.  One  
>>> problem,
>>> though, is that while I thought that the Encode module could do  
>>> that,
>>> apparently it can only barf when decoding unicode input, so we would
>>> have to find another way to find the invalid chars and change them
>>> over.
>> There is a third param to Encode::decode which specifies the behavior
>> of invalid decodes, which by default is to die, but can warn, ignore
>> or silently substitute IIRC.  So I think this could be make to
>> substitute the invalid character marker without much problem.
> Yes, you're referring to the CHECK parameter which, unfortunately,  
> works
> for every encoding type *except* unicode.
> http://search.cpan.org/~dankogai/Encode-2.32/Encode.pm#Handling_Malformed_Data
> NOTE: Not all encoding support this feature
>    Some encodings ignore CHECK argument. For example, Encode::Unicode
> ignores CHECK and it always croaks on error.

Here's a little test script I wrote which turns the unidentified  
characters into their \x counterparts (i.e., literal ASCII  
representation of the hex value).  This uses FB_PERLQQ as a check  
param.  You can see that it properly encodes/decodes valid utf-8  
codepoints, but anything which it is unable to handle it'll turn into  
the corresponding hex escape.  This in my mind is more informative  
that something odd is going on, but it prevents things from blowing  
up, while still allowing the full range of unicode characters which  
are properly encoded.


   #!/usr/bin/env perl

   use strict;
   use warnings;

   use Encode;

   print "--------\n";
   print "Encode test script\n";
   print "Encode module version: $Encode::VERSION\n";

   my $cp1252_octets = "I can\222t believe it\222s not utf-8!";
   my $utf8_octets   = "It doesn't make \xC2\xA2\xC2\xA2!";

   for my $octets ($cp1252_octets, $utf8_octets) {
       print "--------\n";
       printf "Length of original string: %d\n", length $octets;

       my $string = eval {
           decode('utf8', $octets, Encode::FB_PERLQQ);

       warn "Died in utf-8 decode: $@\n" if $@;

       print "Original string: $octets\n";
       print "Decoded string (internal perl representation): $string\n";
       printf "Encoded utf-8 output string: %s\n", encode_utf8($string);
       printf "Length of decoded string: %d\n", length $string;

The output on my machine from said script:

   oy:~ machack$ perl utf8_test.pl
   Encode test script
   Encode module version: 2.23
   Length of original string: 31
   Original string: I can\222t believe it\222s not utf-8!
   Decoded string (internal perl representation): I can\x92t believe it 
\x92s not utf-8!
   Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
   Length of decoded string: 37
   Length of original string: 21
   Original string: It doesn't make ¢¢!
   Decoded string (internal perl representation): It doesn't make  
   Encoded utf-8 output string: It doesn't make ¢¢!
   Length of decoded string: 19
   oy:~ machack$


David Christensen
End Point Corporation
david at endpoint.com

More information about the interchange-users mailing list