[baseten-users] Unicode strings in BaseTen

Tuukka Norri tuukka.norri at karppinen.fi
Mon May 18 17:54:00 EEST 2009


Erik Aderstedt kirjoitti 18.5.2009 kello 15.37:
> I have to confess to not knowing very much about Unicode, but after  
> some Googling I found that I could solve my problem with the  
> different hash values by using -[NSString  
> precomposedStringWithCanonicalMapping] on the value that was stored  
> using  -[BXDatabaseContext  
> createObjectForEntity:withFieldValues:error:]. The hash value on the  
> resulting string is what is expected.
>
> The above workaround does not directly relate to BaseTen, but I  
> thought I'd include it here for completeness.

Hi!

For some characters Unicode has two representations. As you wrote, Ö  
may be either “latin capital letter o with diaeresis” or “latin  
capital letter o” followed by “combining diaeresis”. Strings may  
contain both or be normalized to various forms. When storing string  
values, BaseTen uses normalization form D, which means that composed  
characters are replaced with their decomposed forms. I think we have  
done this at least since version 1.5. When fetching, normalization  
isn't done but possibly should.

-createObjectForEntity:withFieldValues:error: causes BXDatabaseObject  
to return the string with precomposed characters, because it caches  
the value from the field values dictionary. Right now I'm not sure how  
we should handle this case.

NSString's returning of different hash values is probably intended  
behaviour. If one string has been normalized and another hasn't, they  
probably shouldn't be considered to equal each other in all situations.
-- 
Best regards,
Tuukka Norri
MK&C



More information about the baseten-users mailing list