07-21-2023, 07:31 AM
If your text layer is "Déjà" and you retrieve the text/markup, you get 'D\xc3\xa9j\xc3\xa0'. The len() of this is 6, when Déjà is 4 characters. This is because what you got is a sequence of bytes which is the UTF-8 encoding of the Unicode representation of "Déjà" , where the é (U+00E9) and the à (U+00E0) are replaced by their UTF-8 encodings(*), that use two bytes.
This is because in Python2 plain strings are just arrays of bytes. Since Gimp supports the whole Unicode set, when you obtain a text from the Gimp API, Gimp returns the UTF-8 encoding of the text.
Python2 however supports text that uses all Unicode characters, using the unicode type, and you can convert string to unicode and vice-versa using decode() and encode() methods.
So, if we go back to the text layer, and do pdb.gimp_text_layer_get_text(layer).decode('utf-8'), we get a unicode object that has a length of 4 and is u'D\xe9j\xe0', so non ASCII characters are replaced by their Unicode encoding that fits a single element of the sequence.
(*) technically, the whole string is encoded in UTF-8, but, by design, the plain ASCII characters (up to 0x7F) are UTF-8 encoded by themselves so when you are only concerned by American English not handling UTF-8 sort of works.
This is because in Python2 plain strings are just arrays of bytes. Since Gimp supports the whole Unicode set, when you obtain a text from the Gimp API, Gimp returns the UTF-8 encoding of the text.
Python2 however supports text that uses all Unicode characters, using the unicode type, and you can convert string to unicode and vice-versa using decode() and encode() methods.
So, if we go back to the text layer, and do pdb.gimp_text_layer_get_text(layer).decode('utf-8'), we get a unicode object that has a length of 4 and is u'D\xe9j\xe0', so non ASCII characters are replaced by their Unicode encoding that fits a single element of the sequence.
(*) technically, the whole string is encoded in UTF-8, but, by design, the plain ASCII characters (up to 0x7F) are UTF-8 encoded by themselves so when you are only concerned by American English not handling UTF-8 sort of works.