[ic] Non-US keys = UTF-8 issue?
Grant
emailgrant at gmail.com
Sat Mar 1 08:01:12 EST 2008
> Grant,
>
> Did you sort out the last matter you mentioned, regarding getting UTF-8
> data into MySQL?
I gave up quickly. The truth is I don't receive enough data that
needs to be in UTF-8 to spend too much time on it at this point. It
would of course be nice to have IC working well with UTF-8 but there
are others items with a higher priority for me right now. I was
hoping to set a catalog variable I suppose. It sounds like End
Point's work with UTF-8 will go a long way toward improving it, and I
thank you guys in advance.
- Grant
> IIRC, Interchange doesn't do much of anything with the incoming data
> (for a POST or whatever) as far as encoding is concerned; it simply
> assumes raw encoding on the filehandle between Interchange and the
> vlink/tlink script.
>
> I believe this can work, provided that:
> * the actual web pages themselves, and the forms therein, are properly
> encoded with UTF-8, marked as such, and thus the browser submits data in
> UTF-8;
> * the client encoding on your DBD::mysql connection is set to raw, or
> whatever MySQL's equivalent encoding name for this is (I cannot
> remember; I seem to recall that MySQL may treat the latin1 encoding as
> simple raw encoding, in which case it wouldn't make a difference -- I
> moved to Postgres when I started dealing with any real UTF-8 data).
>
> This is all just treating it as raw data, which isn't necessarily
> ideal. For one, if the data is coming in as raw byte strings (as
> outlined above), then regexes will give you funky behavior (for
> instance, the HTML entity encoding routines will appear to break your
> data). This is because in a raw string, each character represents an
> octet rather than an actual character, but Perl has no way of knowing
> that. So, what is in fact a valid high-bit sequence in UTF-8 (for
> representing any character outside the 7-bit ASCII range) will appear as
> a a series of odd characters in the raw string if you were to simply
> print the raw string to a non-UTF8 terminal. In order for regexes to
> work reliably, the raw data needs to be re-encoded as a UTF8 scalar,
> which requires messing with the Perl Encode module.
>
> If you don't need to run regexes or HTML entity filters or whatever
> against your inbound data, then you could probably get by with raw
> encoding. Otherwise, this will probably bite you.
>
> Assuming the data gets safely into MySQL as well-formed UTF8 (or
> assuming the data already exists in MySQL), pulling the data out is
> another matter. You'll need to look at the docs for DBD::mysql to see
> what it offers for UTF8 support, or to see if reading the data in from a
> database handle with the client encoding set to UTF8 would do the
> trick. Basically, UTF8 data coming out of the database will break in
> things like the table editor because of the same regular expression
> problem already mentioned; byte sequences that correspond to a single
> logical character are treated as separate characters and therefore
> semantically mismatch with the intentions of the regular expressions for
> things like HTML entity escaping. DBD::Pg (for Postgres) provides a
> setting for telling the driver to properly elevate text scalars to UTF8,
> which can address this issue; I'm not familiar with DBD::mysql's
> offerings for this sort of thing. If you can get the data returned from
> MySQL to be automatically elevated to UTF8 before Interchange touches
> it, then you may pull it off.
>
> It's a complicated issue. Once you have one Perl scalar that is marked
> internally as UTF8, any scalar it combines with will be elevated
> on-demand to UTF8. So, in theory, having one UTF8 string coming from
> one column in one record of your database could cause the entire output
> buffer for a page to be elevated. But what about all your template
> pages and such, and their encodings? File encoding is a somewhat
> mysterious topic, since files aren't typically flagged as being in a
> particular encoding. You have to know what kinds of encoding you're
> using in every aspect of your application in order for this to work out
> in a controlled fashion.
>
> And of course getting everything elevated to UTF8 will impose some kind
> of performance penalty. Probably not anything worth worrying about, I
> would guess, but it's best to be prepared.
>
> As Jon said a little while ago, we're (that is, End Point) preparing a
> change set to improve UTF8 support, and we've been making good
> progress. Once it's ready and the IC core team has their say, it should
> help considerably. However, it will remain a complex issue that
> requires a lot of attention to detail and a significant headache, as it
> affects all layers of the software stack.
>
> One final note: if you're working with UTF-8, then you will inevitably
> end up feeling a deep sense of loathing for CP1252, because it pops up
> *everywhere*. If your MySQL data is supposedly latin1, then it's almost
> certainly really CP1252. :)
>
> Thanks.
> - Ethan
More information about the interchange-users
mailing list