[baseten-users] Unicode strings in BaseTen
Tuukka Norri
tuukka.norri at karppinen.fi
Mon May 18 17:54:00 EEST 2009
Erik Aderstedt kirjoitti 18.5.2009 kello 15.37:
> I have to confess to not knowing very much about Unicode, but after
> some Googling I found that I could solve my problem with the
> different hash values by using -[NSString
> precomposedStringWithCanonicalMapping] on the value that was stored
> using -[BXDatabaseContext
> createObjectForEntity:withFieldValues:error:]. The hash value on the
> resulting string is what is expected.
>
> The above workaround does not directly relate to BaseTen, but I
> thought I'd include it here for completeness.
Hi!
For some characters Unicode has two representations. As you wrote, Ö
may be either “latin capital letter o with diaeresis” or “latin
capital letter o” followed by “combining diaeresis”. Strings may
contain both or be normalized to various forms. When storing string
values, BaseTen uses normalization form D, which means that composed
characters are replaced with their decomposed forms. I think we have
done this at least since version 1.5. When fetching, normalization
isn't done but possibly should.
-createObjectForEntity:withFieldValues:error: causes BXDatabaseObject
to return the string with precomposed characters, because it caches
the value from the field values dictionary. Right now I'm not sure how
we should handle this case.
NSString's returning of different hash values is probably intended
behaviour. If one string has been normalized and another hasn't, they
probably shouldn't be considered to equal each other in all situations.
--
Best regards,
Tuukka Norri
MK&C
More information about the baseten-users
mailing list