[ic] ic-utf8 readfile/writefile patch

Mike Heins mike at perusion.com
Mon Mar 16 06:48:09 UTC 2009

Quoting David Christensen (david at endpoint.com):
> On Mar 15, 2009, at 11:18 PM, Mike Heins wrote:
> > Quoting David Christensen (david at endpoint.com):
> >> Folks,
> >>
> >> I've added a patch to the ic-utf8 tree to support encoding/fallback
> >> strategy in Vend::File::readfile and writefile.  This is intended to
> >> be completely backwards-compatible with both legacy encodings and the
> >> current MV_UTF8 scheme while offering the following benefits:
> >>
> >>  - Explicit override of the encoding of any specific file.  This
> >> defaults to nothing (aka raw) when MV_UTF8 is not set, and utf-8 when
> >> MV_UTF8 is set.
> >>  - Sensible default fallback to provide maximum information in the
> >> case that invalid encoding/decoding sequences are encountered.
> >> (Fallback strategy is how we deal with invalid/incomplete  
> >> characters.)
> >>  - Think future modifications to [include] to provide access to
> >> encoding and fallback parameters:  [include file="foo/bar/baz"
> >> encoding="cp1252"]
> >>
> >> I'd appreciate testing of this patch; in particular, this should help
> >> with Racke's issue encountered with legacy encodings on the index  
> >> page
> >> with MV_UTF8 set.
> >
> > Has anyone thought of performance? Can this be disabled for people who
> > don't want to spend processor power on UTF8?
> This should have no discernible performance impact for the legacy  
> mode; it's just a few additional if checks.

"Legacy mode"? Do you mean non-UTF8? That won't be legacy for everyone
quite yet. In Perl 5, UTF8 is a quagmire on the order of threads.
Neither are production ready after years of effort. Anyone who is not
forced to use it shouldn't have to.

If it is part of the main line code, there will be the potential for
breakage at any time.

Also, tracing this down, it *isn't* just a few ifs. It does a 
$class->routine() call which calls Encode, which does multiple
subroutines at a minimum. 

> I know you'd had some performance concerns before; were there some
> test cases you'd been able to isolate where you were finding
> specific issues?

We of course isolated the post and file upload performance problem. Has
that problem been fixed? Not to my knowledge -- we are calling
Vend::CharSet for every value. That routine slows uploads by an order of
magnitude. Needlessly unless you are using UTF8.

And why are we using Vend::CharSet->routine() instead of
Vend::CharSet::routine()? Including the internal calls in the module? We
don't think there is going to be any subclassing going on, do we? That
method of call is twice as expensive.  It is syntactic sugar we don't
need. There is no created object, and the code shows a near-zero potential
for Vend::CharSet being a base class.

> I'd be glad to work to make any of these changes as low-impact as
> possible.

My point is that they should be near-zero impact if UTF8 is not a part
of the system. Not low impact, no impact. Right now they impact things
by breaking them -- a pretty big impact -- as well as slowing them down
greatly. Not good. In fact, that would normally make it a candidate for
ripping out by the roots until it is fixed.

Mike Heins
Perusion -- Expert Interchange Consulting    http://www.perusion.com/
phone +1.765.328.4479  <mike at perusion.com>

Experience is what allows you to recognize a mistake the second time you
make it. -- unknown

More information about the interchange-users mailing list