[ic] Call for testers

Mon Jun 22 03:42:55 UTC 2009

On Jun 20, 2009, at 6:58 PM, Gert van der Spoel wrote:

> I have been looking at the GDBM problems and have been able to solve  
> them in
> my particular case
> ... I only am not able to rationalize the changes, which is annoying  
> hehe..

 From your patch to GDBM.pm, you just removed the utf8 filters on the  
GDBM handle, so the data you read in from the GDBM file is just the  
raw value for the key.  Was the issue you were trying to solve one of  
corrupt values being read in, or was it missing values?  Without  
upgrading the contents of the GDBM files, the utf8 filters would  
(presumably) die when trying to decode data from an 8-bit encoding  
such as latin-1, so there would never be a successful read for keys  
with hi-bit values unless the underlying encoding already was utf8.   
If values were missing, that would presumably be due to hi-bit keys  
being looked up by their encoded form, which is no longer the same as  
the raw octets.

The original testing may have been done on ASCII only, which is a  
trivial subset of UTF8, which may be why we didn't encounter this  
issue before, as decoding utf8 for an all ASCII byterange is a noop,  
and hence would not result in dying.

It seems to me that we have a few choices:

1) run some sort of catalog upgrade program to update any GDBM file  
from their original encoding to UTF8-encoded.  This has the downside  
of being a mutable change to the catalog which is not nicely  
downgradeable.

2) document the fact that these files need to be updated.  This relies  
on people being knowledgeable enough to handle this on their own.

3) define a variable with a default fallback legacy encoding, which if  
UTF8 decoding fails, attempts to decode with the legacy encoding.  In  
this case, this would enable the transparent handling of such  
differences in encoding, although there could certainly be some  
possibility of errors creeping in.

4) have some sort of metadata table which contains a list of tables  
and their known encoding status.  This would enable us to decode the  
specific encoding on the fly, and would also enable us to perform a  
bulk encoding change to UTF8 if the user desired at some point.

I'm leaning to some combination of 3 and 4, but if anyone else has  
ideas about how to manage this, I'd be glad to hear them.

Regards,

David
--
David Christensen
End Point Corporation
david at endpoint.com