Mirrors | Updates | Feedback | Changes | Wishlist | Team
We've had a report that Korean display got a lot worse in the snapshots between r5002 and r5003 <000001c513c9$e63ae1e0$aa000059@ktd>. Quoth Simon:
Looking at the code, I think I can see why this is happening. This is to do with RDB's idea that when the user selects `use font encoding' and a font with a DBCS encoding, the terminal code should simply store the individual bytes in individual character cells and rely on
do_text()being passed a string of these so that
TextOut()can reconstitute pairs of DBCS bytes into double-width characters.
As far as I can tell,
terminal.cdoes not mark the first byte of a DBCS character stored in this way. Therefore, the mechanism is fundamentally dependent on a
do_text()run happening to begin at the correct point mod 2! Hence the comment in the mail referenced above, which said that there was already some breakage when the cursor moved over a double-byte character - the half-character under the cursor cannot be properly redrawn. Owing to `font-overflow', though, when you move the cursor over a double-byte character we now redraw a lot of text to the right of that as well, and if the cursor is on the first half of the character then this is bound to be incorrect mod 2; so the problem shows up a lot more readily. I'd bet that the same breakage could have been seen in previous versions if the window was covered and re-exposed when the cursor was in a problem position.
A real fix for this would involve implementing proper DBCS support, by detecting DBCS lead bytes in the
terminal.cinput data stream and storing both bytes in the same character cell using the existing
UCSWIDEmechanism. I have occasionally wondered about doing this: I envisage that we would co-opt the top half of the
unsigned longspace (never used by any flavour of Unicode/UCS ever) to provide more than enough fake character encodings for the purpose.
Of course, if we were going to support DBCSes in
terminal.cit would also be good to be able to support them properly, by translating them to Unicode on input.
Summary: I think this has always been broken, and now it's merely more obviously broken. I regret the effect on CJK users who had found the previous behaviour worked just about well enough, but I don't think a hurried fix is in the general interest.
UTF-8 mode should work reasonably well. A workaround is to use UTF-8 if possible (perhaps via something such as luit or screen).
Audit trail for this semi-bug.