06-24-2019, 03:50 PM
Something I'm considering doing here soon, which falls into the 'needed infrastructure improvement' category, is to deal with our support of Unicode. The whole issue of Unicode is a mess. Currently we use UTF-16 on Windows since that's the native format. But it sometimes take two of these to represent a character. That makes it possible that you can split a character in the middle if you aren't careful, and in theory you can never really just index a string because the Nth UTF-16 code point may be in the middle of a character.
Some folks argue for UTF-8, but it's just so complicated. I use it for persisted text, where it's great. No endian issues and compact. But for an in-memory representation, it's brutal. It's variable length, so just to get the length of a string you have to traverse the whole thing and count the characters. Characters can't be just native types anymore, they'd have to be a little structure of some sort with a byte count. A lot of people act like it's all simple and why would you ever want to get the 3rd character from a string. But it's utterly common to do such things.
The other option is UTF-32, which uses 4 bytes. That's the native Linux format. It wastes some bytes, but it allows us to handle all Unicode characters as fundamental values, index strings in a natural way, calculate the size of a string from the byte size of the buffer, do calculations on byte offsets and such.
But, that doesn't deal with 'graphemes'. A lot of things you see displayed aren't individual code points, but a combination of two that make up one displayed 'glyph'. So, even with UTF-32 you might split a grapheme. That doesn't create invalid Unicode though, which is important. It just means that something that was intended to be displayed as two bits together get one bit going to one string and one bit going to another. I'm not sure that's worth worrying about relative to the other advantages of still having actual characters and indexable strings, particularly given that there is zero means of algorithmically determining the boundaries of graphemes anyway. You can only depend on them not having any of the usual separation characters in them (spaces, new lines, etc...) In terms of text processing in general, I think that treating each UTF-32 value as a character is reasonable.
A lot of people say you have to UTF-8 these days and act like indexing a string or having to find the length of the string or get some character near the middle or some such are some kind of unusual operations. But it clearly those are common, and a lot of things you would commonly do in text manipulation become a lot harder and much higher overhead.
Of course UTF-16 is still the native Windows format, so Windows would necessarily become the worst case scenario in that all text would have to get converted to native on each use. Or, the string class would maintain a separate native format pointer and any time the internal format is change a flag will be set indicating that any use of the native format requires it be re-generated. That would then though mean that a string would have to hold both a UTF-32 and UTF-16 form of the string, which is a lot given that most of the characters are actually 1 byte. But it's the only thing that would ensure good performance. Even then it could hurt because if you change one character in the internal format, the native format has to be regenerated. In some cases that would be every time, such as updating display text every time the user types another character.
If we did UTF-8 we'd have to do something similar since the conversion is even higher overhead in that case, being variable length. In that case we'd pay the price of the complexity of UTF-8 and still need to keep around the UTF-16 (Windows) or UTF-32 (Linux) format for passing to system APIs, or convert from UTF-8 to native every time text is passed to the system. So it either wouldn't save much space and would introduce a lot of complexity, or it would save a lot of space but introduce a lot of complexity and overhead.
Of course not using the native wide character format means that, even down in the operating system wrapper layer, we can't use any native text processing helper stuff. We need to move it up to our own code, which is potentially something that will be difficult. But, otherwise, we have to convert to native, use some native processing, then convert back. That would be sort of stupidly heavy for most operations.
Anyhoo, I'm sort of typing out loud. It's all quite heavy. But something needs to be done. The path of least resistance is to get something that works for Linux and Windows and that drastically reduces the chances on Windows that we split characters when dealing with non-western languages. That would argue for UTF-32. But, maybe it's time to just suck it up and do UTF-8 and be done with it and accept the pain. I dunno.
Some folks argue for UTF-8, but it's just so complicated. I use it for persisted text, where it's great. No endian issues and compact. But for an in-memory representation, it's brutal. It's variable length, so just to get the length of a string you have to traverse the whole thing and count the characters. Characters can't be just native types anymore, they'd have to be a little structure of some sort with a byte count. A lot of people act like it's all simple and why would you ever want to get the 3rd character from a string. But it's utterly common to do such things.
The other option is UTF-32, which uses 4 bytes. That's the native Linux format. It wastes some bytes, but it allows us to handle all Unicode characters as fundamental values, index strings in a natural way, calculate the size of a string from the byte size of the buffer, do calculations on byte offsets and such.
But, that doesn't deal with 'graphemes'. A lot of things you see displayed aren't individual code points, but a combination of two that make up one displayed 'glyph'. So, even with UTF-32 you might split a grapheme. That doesn't create invalid Unicode though, which is important. It just means that something that was intended to be displayed as two bits together get one bit going to one string and one bit going to another. I'm not sure that's worth worrying about relative to the other advantages of still having actual characters and indexable strings, particularly given that there is zero means of algorithmically determining the boundaries of graphemes anyway. You can only depend on them not having any of the usual separation characters in them (spaces, new lines, etc...) In terms of text processing in general, I think that treating each UTF-32 value as a character is reasonable.
A lot of people say you have to UTF-8 these days and act like indexing a string or having to find the length of the string or get some character near the middle or some such are some kind of unusual operations. But it clearly those are common, and a lot of things you would commonly do in text manipulation become a lot harder and much higher overhead.
Of course UTF-16 is still the native Windows format, so Windows would necessarily become the worst case scenario in that all text would have to get converted to native on each use. Or, the string class would maintain a separate native format pointer and any time the internal format is change a flag will be set indicating that any use of the native format requires it be re-generated. That would then though mean that a string would have to hold both a UTF-32 and UTF-16 form of the string, which is a lot given that most of the characters are actually 1 byte. But it's the only thing that would ensure good performance. Even then it could hurt because if you change one character in the internal format, the native format has to be regenerated. In some cases that would be every time, such as updating display text every time the user types another character.
If we did UTF-8 we'd have to do something similar since the conversion is even higher overhead in that case, being variable length. In that case we'd pay the price of the complexity of UTF-8 and still need to keep around the UTF-16 (Windows) or UTF-32 (Linux) format for passing to system APIs, or convert from UTF-8 to native every time text is passed to the system. So it either wouldn't save much space and would introduce a lot of complexity, or it would save a lot of space but introduce a lot of complexity and overhead.
Of course not using the native wide character format means that, even down in the operating system wrapper layer, we can't use any native text processing helper stuff. We need to move it up to our own code, which is potentially something that will be difficult. But, otherwise, we have to convert to native, use some native processing, then convert back. That would be sort of stupidly heavy for most operations.
Anyhoo, I'm sort of typing out loud. It's all quite heavy. But something needs to be done. The path of least resistance is to get something that works for Linux and Windows and that drastically reduces the chances on Windows that we split characters when dealing with non-western languages. That would argue for UTF-32. But, maybe it's time to just suck it up and do UTF-8 and be done with it and accept the pain. I dunno.
Dean Roddey
Explorans limites defectum
Explorans limites defectum