Summary: | en_US.UTF-8 contains combining_* | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | Simos Xenitellis <simos.bugzilla> | ||||||||||||||
Component: | Lib/Xlib (data) | Assignee: | Daniel Stone <daniel> | ||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> | ||||||||||||||
Severity: | normal | ||||||||||||||||
Priority: | high | CC: | jeremyhu, leoboiko, samuel.thibault | ||||||||||||||
Version: | unspecified | Keywords: | i18n, patch | ||||||||||||||
Hardware: | x86 (IA32) | ||||||||||||||||
OS: | Linux (All) | ||||||||||||||||
Whiteboard: | 2011BRB_Reviewed | ||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||
Attachments: |
|
Description
Simos Xenitellis
2005-11-21 01:06:20 UTC
reassigning per irc It does not belong to xkeyboard-config Reassigning to xorg-team@lists.x.org as this is not an XKB bug. Created attachment 5352 [details] [review] Patch to rename <dead_space> to <space> The keysym name is <space> and is used extensively in the file. However, there are two occurences of <dead_space> that are not defined in keysymdef.h, so they are probably typos. The patch renames them <space> and all should be ok (tm). i've committed the patch from c#4, but what did we decide to do with combining_* ? (In reply to comment #5) > i've committed the patch from c#4, but what did we decide to do with combining_* ? I am not aware of a decision about the combining_xxxx symbols. I do not know what the value of "combining_xxxx" is. Are they aliases (friendly names) to combining diacritical marks? For example, is "combining_acute" the same with "0x0301"? I did not find a reference in the source code ("grep -ir combining *") when I last searched. to be honest, I'm not sure ... Sorry about the phenomenal bug spam, guys. Adding xorg-team@ to the QA contact so bugs don't get lost in future. Excerpt from xfree86 keysymdef.h, in the XK_VIETNAMESE section #define XK_combining_tilde 0x1e9f /* U+0303 */ #define XK_combining_grave 0x1ef2 /* U+0300 */ #define XK_combining_acute 0x1ef3 /* U+0301 */ #define XK_combining_hook 0x1efe /* U+0309 */ #define XK_combining_belowdot 0x1eff /* U+0323 */ So it's just legacy vietnamese-purpose keysyms. Our current compose files suggest that these should be typed before the base letter. The name of the keysym and the comments above however suggest that they be typed after it (just like unicode combining characters are put after the base character). There's an old file on http://www.cl.cam.ac.uk/~mgk25/ucs/keysyms.txt which says the latter is correct (and hence all use of combining in our compose files are just bogus), and that these keysyms should actually be dropped. Some research on the vietnamese language shows that accents are usually typed after the base letter, and this is the usual way with VIQR, so I also believe the latter is correct. I think we can put these defines back by using 0x10003xy, and we need to fix the compose files into typing them after the base letter. We'd actually also maybe need to fix our current vn keyboard layout, which is currently using dead_foo. I've asked a vietnamese friend for a confirmation, but it looks to me like this should be fixed back to using combining_foo, just like in VIQR, and just like windows does actually... My vietnamese friend confirms: the standard way in vietnam is to type the base letter and then the accent. So we need to fix the vn keymap into using combining_*. The file to fix is http://webcvs.freedesktop.org/xkeyboard-config/xkeyboard-config/symbols/vn?view=markup You need to file a bug report similar to bug 7807. Contact the last person that did work on the Vietnamese layout (see bug 7807). Mmm, however that will require changes in both xproto, xlib and eventually xkeyboard-config (that one needs to be last). Created attachment 14846 [details] [review] Puts back the combining_ keysyms This puts back the combining_ keysyms, using the unicode value. Created attachment 14848 [details] [review] Drops the bogus combining_* statements Mmm, I was about to submit a patch that fixes a huge lot of <Multi_key> <grave> <A> : "Ã~@" Agrave # LATIN CAPITAL LETTER A WITH GRAVE -<combining_grave> <A> : "Ã~@" Agrave # LATIN CAPITAL LETTER A WITH GRAVE +<A> <combining_grave> : "Ã~@" Agrave # LATIN CAPITAL LETTER A WITH GRAVE <dead_acute> <A> : "Ã~A" Aacute # LATIN CAPITAL LETTER A WITH ACUTE But actually trying it gives an idea of the nightmare: since combining_* keysyms are supposed to be typed after the base letter, the Xlib would have to wait for it before sending e.g. <A> to the application. And since we have those for a pretty huge lot of keysyms, almost no key shows up immediately. As a result, here is instead a patch that just drops them, since they are bogus anyway. Instead, I'll have to implement an input method. This patch and the previous one can be applied immediately to clean up all the currently bogus code :) The vn keyboard layout patch will have to wait for proper combining_* support in libX11. Created attachment 14849 [details] [review] Vietnamese compositions Mmm, that said, the only compositions that are really needed are these. I'm just wondering: when setting <foo> <bar> : "baz" in a Compose file, is there a strong reason to then always prevent <foo> from being typed alone? If libX11 could, when typing <foo> then something else than <bar>, produce foo, then the attached compose file would be completely fine for a vietnamese locale: when typing <A>, libX11 would have to wait for the next keypress, in case it is e.g. <combining_acute> and produce Á, and if it's something else not cited in the Compose table, produce A and the following. This won't be frowned upon by vietnamese people since that's how it's usually being typed. Of course, a nice xinput method as implemented in scim helps, but I don't think we should try to reimplement one in libX11, considering how well the solution above already behaves. Created attachment 25644 [details] [review] Drops the bogus combining_* statements Could patches 14846 25644 be applied please? They do not implement anything, but they clean things and make it clear what combining_* are. I pushed commit 79f47e6dff2f0a0b673bbfecc47528edca814baa which is based on 25644 (which no longed applied cleanly, and which was made before fi was split off from en). Instead of pushing 14846, I replaced the target combining_ keysyms (which do not exist in x11proto’s keysymdef.h) in the Compose tables with the appropriate Uxxx symbols. That seems cleaner. le Fri 08 May 2009 17:26:34 -0700, a écrit : > I pushed commit 79f47e6dff2f0a0b673bbfecc47528edca814baa which is > based on 25644 (which no longed applied cleanly, and which was made > before fi was split off from en). Oops, sorry, I thought I had pulled before regenerating the patch but apparently not. > Instead of pushing 14846, I replaced the target combining_ keysyms > (which do not exist in x11proto’s keysymdef.h) ?! The purpose of 14846 was precisely to restore the XK_combining_ keysyms, since they are part of the X11 standard and some applications may even depend on them. And while restoring it, giving them the unicode values instead of the legacy values, as it becomes clean for e.g. gtk to implement them: they're just unicode combining characters. > in the Compose tables with the appropriate Uxxx symbols. That seems > cleaner. Urgl, what are these rules: <U0331> <B> : "Ḇ" U1E06 # LATIN CAPITAL LETTER B WITH LINE BELOW ?! They're plain wrong! U3xy are combining characters, i.e. they are supposed to be typed _after_ the letter they modify. As I explained earlier in this bug report, ideally we should have a series of rules like <B> <U0331> : "Ḇ" U1E06 # LATIN CAPITAL LETTER B WITH LINE BELOW but that means that when typing B, we do not get any output. Just not including them is not so bad, contracted unicode characters will not be used, that's all. If people want dead keys, they need to define XK_dead_* keysyms and use them, not change the meaning of other keysyms!! > Oops, sorry, I thought I had pulled before regenerating the patch > but apparently not. No biggie. >> Instead of pushing 14846, I replaced the target combining_ keysyms >> (which do not exist in x11proto’s keysymdef.h) > The purpose of 14846 was precisely to restore the XK_combining_ > keysyms, since they are part of the X11 standard and some applications > may even depend on them. I was not aware that they are part of the standard. I see that they were added to (xorg) cvs in 04/Apr and removed in 04/Sept. In Xfree86, they were added in rev 1.10, which has this as part of its commit log: ,---- | 820. Pablo Saratxaga's i18n updates for XFree86 that are used in Mandrake 7.2. | Includes various new and fixed xkb files, locale name additions and | updates, and new support for varios charset encodings (#4195, | Pablo Saratxaga). `---- My understanding was that the Uxxx codes are preferred. But I could be wrong here. > Urgl, what are these rules: > <U0331> <B> : "Ḇ" U1E06 # LATIN CAPITAL LETTER B WITH LINE BELOW Oh.. I forgot about those yesterday. They have been there for a while, and there is a bug about fixing them, too. The goal is to remove any which are superfluous and modify those which are the only non-Multi_key option to use a dead_ key symbol instead. U0331 is in the former group, since dead_belowmacron is available and every line in Compose.pre referencing U0331 already has a mathing dead_belowmacron line. U030F is in the latter group. Ȁ (U+0200) is an example of a character which does not have a dead_ option; only <U030F> <A> already exists. This case doesn’t even have a Multi_key sequence. The full list of terminal characters in en_US.UTF-8 which have a UCS combining character symbol used as though it were a dead key is: ,---- | U0200 U0201 U0202 U0203 U0204 U0205 U0206 U0207 U0208 U0209 U020A | U020B U020C U020D U020E U020F U0210 U0211 U0212 U0213 U0214 U0215 | U0216 U0217 U0218 U0219 U021A U021B U0326 U0476 U0477 U1E00 U1E01 | U1E06 U1E07 U1E0E U1E0F U1E12 U1E13 U1E18 U1E19 U1E1A U1E1B U1E2A | U1E2B U1E2C U1E2D U1E34 U1E35 U1E3A U1E3B U1E3C U1E3D U1E48 U1E49 | U1E4A U1E4B U1E5E U1E5F U1E6E U1E6F U1E70 U1E71 U1E72 U1E73 U1E74 | U1E75 U1E76 U1E77 U1E94 U1E95 U1E96 U1F00 U1F01 U1F02 U1F03 U1F04 | U1F05 U1F06 U1F07 U1F08 U1F09 U1F0A U1F0B U1F0C U1F0D U1F0E U1F0F | U1F10 U1F11 U1F12 U1F13 U1F14 U1F15 U1F18 U1F19 U1F1A U1F1B U1F1C | U1F1D U1F20 U1F21 U1F22 U1F23 U1F24 U1F25 U1F26 U1F27 U1F28 U1F29 | U1F2A U1F2B U1F2C U1F2D U1F2E U1F2F U1F30 U1F31 U1F32 U1F33 U1F34 | U1F35 U1F36 U1F37 U1F38 U1F39 U1F3A U1F3B U1F3C U1F3D U1F3E U1F3F | U1F40 U1F41 U1F42 U1F43 U1F44 U1F45 U1F48 U1F49 U1F4A U1F4B U1F4C | U1F4D U1F50 U1F51 U1F52 U1F53 U1F54 U1F55 U1F56 U1F57 U1F59 U1F5B | U1F5D U1F5F U1F60 U1F61 U1F62 U1F63 U1F64 U1F65 U1F66 U1F67 U1F68 | U1F69 U1F6A U1F6B U1F6C U1F6D U1F6E U1F6F U1F80 U1F81 U1F82 U1F83 | U1F84 U1F85 U1F86 U1F87 U1F88 U1F89 U1F8A U1F8B U1F8C U1F8D U1F8E | U1F8F U1F90 U1F91 U1F92 U1F93 U1F94 U1F95 U1F96 U1F97 U1F98 U1F99 | U1F9A U1F9B U1F9C U1F9D U1F9E U1F9F U1FA0 U1FA1 U1FA2 U1FA3 U1FA4 | U1FA5 U1FA6 U1FA7 U1FA8 U1FA9 U1FAA U1FAB U1FAC U1FAD U1FAE U1FAF | U1FB6 U1FB7 U1FC1 U1FC6 U1FC7 U1FCF U1FD6 U1FD7 U1FDF U1FE4 U1FE5 | U1FE6 U1FE7 U1FEC U1FF6 U1FF7 `---- And the full list of UCS combining character symbols used as though they were dead keys is: ,---- | U030F [ ̏] COMBINING DOUBLE GRAVE ACCENT | U0311 [ ̑] COMBINING INVERTED BREVE | U0313 [ ̓] COMBINING COMMA ABOVE | U0314 [ ̔] COMBINING REVERSED COMMA ABOVE | U0324 [ ̤] COMBINING DIAERESIS BELOW | U0325 [ ̥] COMBINING RING BELOW | U0326 [ ̦] COMBINING COMMA BELOW | U032D [ ̭] COMBINING CIRCUMFLEX ACCENT BELOW | U032E [ ̮] COMBINING BREVE BELOW | U0330 [ ̰] COMBINING TILDE BELOW | U0331 [ ̱] COMBINING MACRON BELOW | U0342 [ ͂] COMBINING GREEK PERISPOMENI `---- > My understanding was that the Uxxx codes are preferred. But I could > be wrong here. Mmm, using the unicode value is better yes, but aren't still supposed to define XK_* values? That makes it way more readable in xkb files. > > Urgl, what are these rules: > > > <U0331> <B> : "Ḇ" U1E06 # LATIN CAPITAL LETTER B WITH LINE BELOW > > Oh.. I forgot about those yesterday. They have been there for a while, > and there is a bug about fixing them, too. Ok. Sorry to parachute on this, but these Compose entries <combining_*> <base>: <precomposed> just popped up in my distro and I got frightened. I don't have the values in keysymdef.h, but if you're really discussing whether to use Unicode combining characters as Compose prefixes, I'd like to humbly ask to never, ever do that. Rationale: Unicode is perfectly capable of encoding characters with diacritics without precomposing. This is what the Unicode combining characters are supposed to be for. And, in Unicode, the combining characters (accents, diacritics &c.) are always, always _after_ the base character[1] (in logical order, i.e., the encoded characters in the file, without regard to bidi issues). So an U+00E1 LATIN SMALL LETTER A WITH ACUTE is completely equivalent to the sequence <U+0061, U+0301> (<LATIN SMALL LETTER A, COMBINING ACUTE ACCENT>). If you map some key to U+0301, you can type the decomposed sequence simply by inputting a regular U+0061, then the accent. No Compose is involved; as far as X and the filesystem are involved, they are two separate characters. It is the responsibility of the font layout system to render them as a single character, and of the software to treat them as a single character for purposes such as searching, counting, case mapping &c. (True, in Linux, most fonts won't display the sequence properly. The popular Bitstream Vera, Liberation, and msttcorefonts all fail; one needs to use DejaVu, GNU unifont, or lesser-known ones such as Gentium. But this issue is orthogonal to the discussion at hand.) It is also true that the Unicode postfix order runs counter the convention of almost all countries' typing schools. But it does have its own advantages: - It is similar to handwriting; it could be argued it would be more logical for computer illiterates. - It avoid hidden states; each key you press change something on screen. - It is better for long sequences. With Unicode you can type e.g. latin small letter a with acute, macron, cedilla, and dot below (by inputting each char in this order). Imagine having Compose sequences for all the possible combinations. I'm not suggesting to change the typing habits of the whole world, far from it; my point is, the Unicode method should be _possible_. Right now one can just map their favorite combinings with xmodmap, or perhaps use something like my us(intl-unicode) layout (shameless plug![2]). But suppose we mix Unicode combining and X composing, like this version of the Compose file seems to suggest. What would happen when you type U+0061, U+0301 (a and acute)? Instead of putting the acute in the a like the Unicode user wants, the system would enter the hidden Compose state, and then put the acute in the _next_ letter (if it's listed in the Compose file at all!). I agree most people want precomposed characters, and most people want to type in prefix order. For that, the current schemes with deadkeys and Multi_key + non-combining chars work perfectly well. Can we let the Unicode combining mechanism separate from this mess? Notes: [1] Well, almost always. I heard there are a couple of exceptions in Arabic or something, but these are deprecated compatibility characters to support round-trip conversion to legacy encodings only. [2] http://github.com/leoboiko/us-intl-unicode > but if you're really discussing whether to use Unicode combining
> characters as Compose prefixes, I'd like to humbly ask to never,
> ever do that.
The only combining keysyms left in the Compose tables in libX11’s git
repo are targets (such as the use of « <dead_abovedot> <nobreakspace> »
to enter U+0307 COMBINING DOT ABOVE) and U+0338 COMBINING LONG SOLIDUS
OVERLAY used postfix in Multi_key-initiated sequences to enter the NFC
version of some negated math symbols. An example of the latter is
« <Multi_key> <less> <U0338> » to enter ≮ U+226E NOT LESS-THAN.
> The only combining keysyms left in the Compose tables in libX11’s git repo are targets and Multi_key-initiated sequences Ok, nothing against those. It makes a weird kind of sense to use the Compose key to make precomposed characters, even if employing combining. Maybe it would even be good to have sequences with Multi_key for every Unicode canonical decomposition? (But the tables are already so large...) > U+0338 COMBINING LONG SOLIDUS OVERLAY to enter the NFC version of some negated math symbols I wonder... U+20D2 COMBINING LONG VERTICAL LINE OVERLAY seems better as a general negation sign; I mean, it's explicitly annotated as "negation" (unlike the solidus), and it's suggested for this use in the book (page 257 of Unicode 5.0), where they say it will change slant or length as required by specific symbols. OTOH the solidus is the canonical decomposition of U+226E, and canonical decompositions are forever. Perhaps map both? bugzilla-daemon@freedesktop.org, le Sat 30 May 2009 08:06:58 -0700, a écrit : > if you're really discussing whether to use Unicode combining > characters as Compose prefixes, I'd like to humbly ask to never, ever > do that. That's precisely what I wanted to get fixed :) Created attachment 27397 [details] [review] Fix vietnamese accents typing Hello, The attached patch fixes the way accents are typed on a vietnamese keyboard: they are normally typed _after_ the vowel, i.e. they are combining accents. For the case when combining accents are not supported (e.g. zsh), I've kept the dead accents in the third level. Samuel (In reply to comment #26) > The attached patch fixes the way accents are typed on a vietnamese > keyboard: they are normally typed _after_ the vowel, i.e. they are > combining accents Hey, the approach you have in the vn layout is quite similar to the one I took on us(intl-unicode). Did you know you can use Unnnn syntax in the symbol files (i.e. U0301 instead of 0x1000301)? Out of curiosity, could you tell me whether Unicode combining is widespread for Vietnamese, for example on Windows and OSX? (I'd expect yes, given the large number of diacritics the language employs). Are us-international keyboard layouts popular in Vietnam, the way they are here in Brazil? Would the Vietnamese perhaps be interested in my us(intl-unicode)? :) > --- Comment #27 from Leonardo Boiko <leoboiko@gmail.com> 2009-07-05 09:50:27 PST --- > (In reply to comment #26) > > The attached patch fixes the way accents are typed on a vietnamese > > keyboard: they are normally typed _after_ the vowel, i.e. they are > > combining accents > > Hey, the approach you have in the vn layout is quite similar to the one I took > on us(intl-unicode). Did you know you can use Unnnn syntax in the symbol files > (i.e. U0301 instead of 0x1000301)? Ah, I was wondering and a quick grep told me the latter. It doesn't matter so much, tough; I'd rather have used a nice keysym name but apparently my patch to include them wasn't applied. > Out of curiosity, could you tell me whether Unicode combining is widespread for > Vietnamese, for example on Windows and OSX? I don't use vietnamese on these OSes, but I'd guess they don't use combining characters but rather precombined characters. > Are us-international keyboard layouts popular in Vietnam, the way they > are here in Brazil? I have no idea. > Would the Vietnamese perhaps be interested in my us(intl-unicode)? :) It'd need đ, Ð, ă, â, ê, ô, ư, ơ, ₫ and the combining characters mentioned above. There is a lot of noise in this bug and a bunch of attachments. What is the status of this work? Can patches that are ready for committing be sent to xorg-devel for review, so we can close this out? Patch from comment #26 is already applied (http://bugs.freedesktop.org/show_bug.cgi?id=22847) Patch from comment #13 (to put back some combining_ keysyms) wasn't applied. The reason invoked is: “Instead of pushing 14846, I replaced the target combining_ keysyms (which do not exist in x11proto’s keysymdef.h) in the Compose tables with the appropriate Uxxx symbols. That seems cleaner.” Which I don't agree with: it's nicer to have names in symbols and compose files rather than unicode numbers... Ok, marking fixed. Thanks. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.