[ic] Non-US keys = UTF-8 issue?

Sat Mar 1 08:01:12 EST 2008

>  Grant,
>
>  Did you sort out the last matter you mentioned, regarding getting UTF-8
>  data into MySQL?

I gave up quickly.  The truth is I don't receive enough data that
needs to be in UTF-8 to spend too much time on it at this point.  It
would of course be nice to have IC working well with UTF-8 but there
are others items with a higher priority for me right now.  I was
hoping to set a catalog variable I suppose.  It sounds like End
Point's work with UTF-8 will go a long way toward improving it, and I
thank you guys in advance.

- Grant

>  IIRC, Interchange doesn't do much of anything with the incoming data
>  (for a POST or whatever) as far as encoding is concerned; it simply
>  assumes raw encoding on the filehandle between Interchange and the
>  vlink/tlink script.
>
>  I believe this can work, provided that:
>  * the actual web pages themselves, and the forms therein, are properly
>  encoded with UTF-8, marked as such, and thus the browser submits data in
>  UTF-8;
>  * the client encoding on your DBD::mysql connection is set to raw, or
>  whatever MySQL's equivalent encoding name for this is (I cannot
>  remember; I seem to recall that MySQL may treat the latin1 encoding as
>  simple raw encoding, in which case it wouldn't make a difference -- I
>  moved to Postgres when I started dealing with any real UTF-8 data).
>
>  This is all just treating it as raw data, which isn't necessarily
>  ideal.  For one, if the data is coming in as raw byte strings (as
>  outlined above), then regexes will give you funky behavior (for
>  instance, the HTML entity encoding routines will appear to break your
>  data).  This is because in a raw string, each character represents an
>  octet rather than an actual character, but Perl has no way of knowing
>  that.  So, what is in fact a valid high-bit sequence in UTF-8 (for
>  representing any character outside the 7-bit ASCII range) will appear as
>  a a series of odd characters in the raw string if you were to simply
>  print the raw string to a non-UTF8 terminal.  In order for regexes to
>  work reliably, the raw data needs to be re-encoded as a UTF8 scalar,
>  which requires messing with the Perl Encode module.
>
>  If you don't need to run regexes or HTML entity filters or whatever
>  against your inbound data, then you could probably get by with raw
>  encoding.  Otherwise, this will probably bite you.
>
>  Assuming the data gets safely into MySQL as well-formed UTF8 (or
>  assuming the data already exists in MySQL), pulling the data out is
>  another matter.  You'll need to look at the docs for DBD::mysql to see
>  what it offers for UTF8 support, or to see if reading the data in from a
>  database handle with the client encoding set to UTF8 would do the
>  trick.  Basically, UTF8 data coming out of the database will break in
>  things like the table editor because of the same regular expression
>  problem already mentioned; byte sequences that correspond to a single
>  logical character are treated as separate characters and therefore
>  semantically mismatch with the intentions of the regular expressions for
>  things like HTML entity escaping.  DBD::Pg (for Postgres) provides a
>  setting for telling the driver to properly elevate text scalars to UTF8,
>  which can address this issue; I'm not familiar with DBD::mysql's
>  offerings for this sort of thing.  If you can get the data returned from
>  MySQL to be automatically elevated to UTF8 before Interchange touches
>  it, then you may pull it off.
>
>  It's a complicated issue.  Once you have one Perl scalar that is marked
>  internally as UTF8, any scalar it combines with will be elevated
>  on-demand to UTF8.  So, in theory, having one UTF8 string coming from
>  one column in one record of your database could cause the entire output
>  buffer for a page to be elevated.  But what about all your template
>  pages and such, and their encodings?  File encoding is a somewhat
>  mysterious topic, since files aren't typically flagged as being in a
>  particular encoding.  You have to know what kinds of encoding you're
>  using in every aspect of your application in order for this to work out
>  in a controlled fashion.
>
>  And of course getting everything elevated to UTF8 will impose some kind
>  of performance penalty.  Probably not anything worth worrying about, I
>  would guess, but it's best to be prepared.
>
>  As Jon said a little while ago, we're (that is, End Point) preparing a
>  change set to improve UTF8 support, and we've been making good
>  progress.  Once it's ready and the IC core team has their say, it should
>  help considerably.  However, it will remain a complex issue that
>  requires a lot of attention to detail and a significant headache, as it
>  affects all layers of the software stack.
>
>  One final note: if you're working with UTF-8, then you will inevitably
>  end up feeling a deep sense of loathing for CP1252, because it pops up
>  *everywhere*.  If your MySQL data is supposedly latin1, then it's almost
>  certainly really CP1252.  :)
>
>  Thanks.
>  - Ethan