swift - What does it mean that two strings have the same linguistic meaning? -
in swift documentation comparing strings, found following:
two string values (or 2 character values) considered equal if extended grapheme clusters canonically equivalent. extended grapheme clusters canonically equivalent if have same linguistic meaning , appearance, if composed different unicode scalars behind scenes.
then documentation proceeds following example shows 2 strings "cannonically equivalent"
for example, latin small letter e acute (u+00e9) canonically equivalent latin small letter e (u+0065) followed combining acute accent (u+0301). both of these extended grapheme clusters valid ways represent character é, , considered canonically equivalent:
ok. somehow e
, é
same , have same linguistic meaning. sure i'll give them that. have taken spanish class sometime , prof wasn't strict on whether used either forms of e
, i'm guessing referring to. fair enough
the documentation goes further show 2 strings not canonically equivalent:
conversely, latin capital letter (u+0041, or "a"), used in english, not equivalent cyrillic capital letter (u+0410, or "А"), used in russian. characters visually similar, not have same linguistic meaning:
now here alarm bells go off , decide ask question. seems appearance has nothing because 2 strings exactly same, , admit in documentation. seems string class looking linguistic meaning
?
this why ask means strings having same/different linguistic meaning, because e
form of e
know used in english, have seen é
being used in languages french or spanish, why given А
used in russian , a
used in english, causes string class not equivalent?
i hope able walk through thought process, question mean 2 strings have same linguistic meaning (in code if possible)?
you said:
somehow e , é same , have same linguistic meaning.
no. have misread document. here's document again:
latin small letter e acute (u+00e9) canonically equivalent latin small letter e (u+0065) followed combining acute accent (u+0301).
here's u+00e9: é
here's u+0065: e
here's u+0301: ´
here's u+0065 followed u+0301: é
so u+00e9 (é) looks , means same u+0065 u+0301 (é). therefore must treated equal.
so why cyrillic А different latin a? utn #26 gives several reasons. here some:
“traditional graphology has treated them distinct scripts, …”
“literate users of latin, greek, , cyrillic alphabets not have cultural conventions of treating each other's alphabets , letters part of own writing systems.”
“even more significantly, point of view of problem of character encoding digital textual representation in information technology, preexisting identification of latin, greek, , cyrillic distinct scripts carried on character encoding, earliest instances of such encodings.”
“[a] unified encoding of latin, greek, , cyrillic make casing operations unholy mess, …”
read tech note full details.
Comments
Post a Comment