[ic] Non-US keys = UTF-8 issue?

Wed Feb 27 20:58:05 EST 2008

Grant wrote:
>>  >>>> Email order receipt will not be send as UTF8 charset, so it's quite
>>  >>>> plausible that Swedish characters are messed up. Proper UTF8 support
>>  >>>> is still under development.
>>  >>>>
>>  >>>> Regards
>>  >>>>          Racke
>>  >>> Will IC pass unicode characters properly to mysql?  Should they be
>>  >>> displayed properly with [value]?
>>  >> As has been noted in this thread already, full unicode support is far
>>  >> from trivial, and is something that can be difficult to put in as an
>>  >> afterthought.  If you are just concerned with the out-going emails
>>  >> (i.e., the site appears to function fine), you can try to use one of
>>  >> the following approaches:
>>  >>
>>  >> If you are using the [email] tag to send out your confirmation/order
>>  >> emails and you know that all of the data will be in the UTF-8
>>  >> encoding, you can add explicit calls to the tag usertag to output
>>  >> mime headers as shown:
>>  >>
>>  >> [email <to, from, etc> extra="[tag op=mime arg=header]"]
>>  >> [tag op='mime' type='text/plain; charset="utf-8"']
>>  >> <body content here>
>>  >> [/email]
>>  >>
>>  >> Another option (depending on how much you want to get your hands
>>  >> dirty) is to roll-your-own email sending usertag/routine in Perl
>>  >> which can harness both Encode and MIME::Lite to explicitly manage/
>>  >> handle the coercion of data to the desired encoding.
>>  >>
>>  >> Please note that if you have non-ascii data that you want to appear
>>  >> in the email headers (to, from, subject, etc) you will need to
>>  >> explicitly encode the data using the MIME-Header encoding to handle
>>  >> this properly.
>>  >>
>>  >> Good Luck,
>>  >>
>>  >> David
>>  >
>>  > Thanks David.  I'm not so much concerned with email being displayed
>>  > properly as I am with having the customer's shipping address.  Maybe
>>  > the thing to do is use [tag] as you suggested to always send a
>>  > separate UTF-8 email to the admin containing just the shipping address
>>  > so we're sure to have that.  We would need to run that UTF-8 address
>>  > through IC to ship though, so that may not do any good anyway.  It
>>  > sounds like UTF-8 data is messed up as soon as it hits IC, but maybe
>>  > not.  I'm still not clear on that.
>>
>>  Check if UTF8 data is stored as such in the database, try to enter
>>  UTF8 strings in user account forms etc.
>>
>>
>>  Regards
>>           Racke
>>     
>
> Alright, thanks for everyone's help with this.
>   

Grant,

Did you sort out the last matter you mentioned, regarding getting UTF-8 
data into MySQL?

IIRC, Interchange doesn't do much of anything with the incoming data 
(for a POST or whatever) as far as encoding is concerned; it simply 
assumes raw encoding on the filehandle between Interchange and the 
vlink/tlink script.

I believe this can work, provided that:
* the actual web pages themselves, and the forms therein, are properly 
encoded with UTF-8, marked as such, and thus the browser submits data in 
UTF-8;
* the client encoding on your DBD::mysql connection is set to raw, or 
whatever MySQL's equivalent encoding name for this is (I cannot 
remember; I seem to recall that MySQL may treat the latin1 encoding as 
simple raw encoding, in which case it wouldn't make a difference -- I 
moved to Postgres when I started dealing with any real UTF-8 data).

This is all just treating it as raw data, which isn't necessarily 
ideal.  For one, if the data is coming in as raw byte strings (as 
outlined above), then regexes will give you funky behavior (for 
instance, the HTML entity encoding routines will appear to break your 
data).  This is because in a raw string, each character represents an 
octet rather than an actual character, but Perl has no way of knowing 
that.  So, what is in fact a valid high-bit sequence in UTF-8 (for 
representing any character outside the 7-bit ASCII range) will appear as 
a a series of odd characters in the raw string if you were to simply 
print the raw string to a non-UTF8 terminal.  In order for regexes to 
work reliably, the raw data needs to be re-encoded as a UTF8 scalar, 
which requires messing with the Perl Encode module.

If you don't need to run regexes or HTML entity filters or whatever 
against your inbound data, then you could probably get by with raw 
encoding.  Otherwise, this will probably bite you.

Assuming the data gets safely into MySQL as well-formed UTF8 (or 
assuming the data already exists in MySQL), pulling the data out is 
another matter.  You'll need to look at the docs for DBD::mysql to see 
what it offers for UTF8 support, or to see if reading the data in from a 
database handle with the client encoding set to UTF8 would do the 
trick.  Basically, UTF8 data coming out of the database will break in 
things like the table editor because of the same regular expression 
problem already mentioned; byte sequences that correspond to a single 
logical character are treated as separate characters and therefore 
semantically mismatch with the intentions of the regular expressions for 
things like HTML entity escaping.  DBD::Pg (for Postgres) provides a 
setting for telling the driver to properly elevate text scalars to UTF8, 
which can address this issue; I'm not familiar with DBD::mysql's 
offerings for this sort of thing.  If you can get the data returned from 
MySQL to be automatically elevated to UTF8 before Interchange touches 
it, then you may pull it off.

It's a complicated issue.  Once you have one Perl scalar that is marked 
internally as UTF8, any scalar it combines with will be elevated 
on-demand to UTF8.  So, in theory, having one UTF8 string coming from 
one column in one record of your database could cause the entire output 
buffer for a page to be elevated.  But what about all your template 
pages and such, and their encodings?  File encoding is a somewhat 
mysterious topic, since files aren't typically flagged as being in a 
particular encoding.  You have to know what kinds of encoding you're 
using in every aspect of your application in order for this to work out 
in a controlled fashion.

And of course getting everything elevated to UTF8 will impose some kind 
of performance penalty.  Probably not anything worth worrying about, I 
would guess, but it's best to be prepared.

As Jon said a little while ago, we're (that is, End Point) preparing a 
change set to improve UTF8 support, and we've been making good 
progress.  Once it's ready and the IC core team has their say, it should 
help considerably.  However, it will remain a complex issue that 
requires a lot of attention to detail and a significant headache, as it 
affects all layers of the software stack.

One final note: if you're working with UTF-8, then you will inevitably 
end up feeling a deep sense of loathing for CP1252, because it pops up 
*everywhere*.  If your MySQL data is supposedly latin1, then it's almost 
certainly really CP1252.  :)

Thanks.
- Ethan

-- 
Ethan Rowe
End Point Corporation
ethan at endpoint.com