[ic] Call for testers
david at endpoint.com
Mon Jun 22 03:42:55 UTC 2009
On Jun 20, 2009, at 6:58 PM, Gert van der Spoel wrote:
> I have been looking at the GDBM problems and have been able to solve
> them in
> my particular case
> ... I only am not able to rationalize the changes, which is annoying
From your patch to GDBM.pm, you just removed the utf8 filters on the
GDBM handle, so the data you read in from the GDBM file is just the
raw value for the key. Was the issue you were trying to solve one of
corrupt values being read in, or was it missing values? Without
upgrading the contents of the GDBM files, the utf8 filters would
(presumably) die when trying to decode data from an 8-bit encoding
such as latin-1, so there would never be a successful read for keys
with hi-bit values unless the underlying encoding already was utf8.
If values were missing, that would presumably be due to hi-bit keys
being looked up by their encoded form, which is no longer the same as
the raw octets.
The original testing may have been done on ASCII only, which is a
trivial subset of UTF8, which may be why we didn't encounter this
issue before, as decoding utf8 for an all ASCII byterange is a noop,
and hence would not result in dying.
It seems to me that we have a few choices:
1) run some sort of catalog upgrade program to update any GDBM file
from their original encoding to UTF8-encoded. This has the downside
of being a mutable change to the catalog which is not nicely
2) document the fact that these files need to be updated. This relies
on people being knowledgeable enough to handle this on their own.
3) define a variable with a default fallback legacy encoding, which if
UTF8 decoding fails, attempts to decode with the legacy encoding. In
this case, this would enable the transparent handling of such
differences in encoding, although there could certainly be some
possibility of errors creeping in.
4) have some sort of metadata table which contains a list of tables
and their known encoding status. This would enable us to decode the
specific encoding on the fly, and would also enable us to perform a
bulk encoding change to UTF8 if the user desired at some point.
I'm leaning to some combination of 3 and 4, but if anyone else has
ideas about how to manage this, I'd be glad to hear them.
End Point Corporation
david at endpoint.com
More information about the interchange-users