Bug 63072 - allow Unicode non-characters as per Corrigendum 9
Summary: allow Unicode non-characters as per Corrigendum 9
Status: RESOLVED FIXED
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Simon McVittie
QA Contact: Havoc Pennington
URL:
Whiteboard:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2013-04-03 10:33 UTC by Simon McVittie
Modified: 2013-04-22 15:28 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
[1.6, master] Accept non-characters when validating Unicode (1.91 KB, patch)
2013-04-22 14:32 UTC, Simon McVittie
Details | Splinter Review
[master] Specification: explicitly allow the Unicode noncharacters (1.48 KB, patch)
2013-04-22 14:33 UTC, Simon McVittie
Details | Splinter Review
[1.6, master] [v2] Accept non-characters when validating Unicode (2.32 KB, patch)
2013-04-22 14:37 UTC, Simon McVittie
Details | Splinter Review

Description Simon McVittie 2013-04-03 10:33:00 UTC
libdbus and the D-Bus Specification currently disallow Unicode non-characters (U+FDD0..U+FDEF, U+xFFFE, U+xFFFF) in UTF-8 strings. This is consistent with pre-2013 versions of GLib.

There has been considerable discussion of this in the past, including:

<http://lists.freedesktop.org/archives/dbus/2010-February/012182.html>

<https://bugs.freedesktop.org/show_bug.cgi?id=40817>

<https://bugzilla.gnome.org/show_bug.cgi?id=107427>

However, Unicode Corrigendum 9 <http://www.unicode.org/versions/corrigendum9.html> clarifies that this was not the intention of the standard, and g_utf8_validate() has been changed <https://bugzilla.gnome.org/show_bug.cgi?id=694669> to consider noncharacters to be valid. This matches the interpretation Thiago advocated in our previous discussions.

We should consider changing the D-Bus Specification, the reference implementation, and any bindings that do their own validity checking (notably dbus-python, at least in git master) to allow non-characters.

As a practical note, GDBus uses g_utf8_validate() to check for validity, so it will happily send messages that dbus-daemon considers to be invalid (and get kicked off the bus as a result).
Comment 1 Simon McVittie 2013-04-03 10:36:41 UTC
Should this change also be made in D-Bus 1.6? Answers on a postcard.

For: if an application using new-GDBus sends a message containing Corrigendum 9 UTF-8, making this change in D-Bus 1.6 means it won't get rejected.

Against: an application expecting a message in "GLib 2.34 UTF-8" could receive an unexpected message in "Corrigendum 9 UTF-8" via a stable-branch dbus-daemon, and crash.

If we're going to make this change at all then my inclination would be to say "yes, also change D-Bus 1.6".
Comment 2 Thiago Macieira 2013-04-03 14:59:58 UTC
"yes, also change D-Bus 1.6"

The number of applications that depend on not receiving non-characters via D-Bus must be vanishingly small.
Comment 3 Simon McVittie 2013-04-22 14:32:45 UTC
Created attachment 78331 [details] [review]
[1.6, master] Accept non-characters when validating Unicode

Unicode Corrigendum #9 clarifies that the non-characters U+nFFFE
(for n in the range 0 to 0x10), U+nFFFF (for n in the same range),
and U+FDD0..U+FDEF are valid for interchange, and their presence
does not make a string ill-formed.

GLib 2.36 made the corresponding change in its definition of UTF-8
as used by g_utf8_validate() and similar functions.
Comment 4 Simon McVittie 2013-04-22 14:33:32 UTC
Created attachment 78332 [details] [review]
[master] Specification: explicitly allow the Unicode noncharacters

This follows Unicode Corrigendum #9.
Comment 5 Simon McVittie 2013-04-22 14:37:18 UTC
Created attachment 78333 [details] [review]
[1.6, master] [v2] Accept non-characters when validating Unicode

Unicode Corrigendum #9 clarifies that the non-characters U+nFFFE
(for n in the range 0 to 0x10), U+nFFFF (for n in the same range),
and U+FDD0..U+FDEF are valid for interchange, and their presence
does not make a string ill-formed.

GLib 2.36 made the corresponding change in its definition of UTF-8
as used by g_utf8_validate() and similar functions.

---

v2: also fix the comment above UNICODE_VALID().
Comment 6 Thiago Macieira 2013-04-22 14:56:03 UTC
Comment on attachment 78331 [details] [review]
[1.6, master] Accept non-characters when validating Unicode

Review of attachment 78331 [details] [review]:
-----------------------------------------------------------------

Ship it!
Comment 7 Thiago Macieira 2013-04-22 14:56:34 UTC
Comment on attachment 78332 [details] [review]
[master] Specification: explicitly allow the Unicode noncharacters

Review of attachment 78332 [details] [review]:
-----------------------------------------------------------------

Ship it!
Comment 8 Thiago Macieira 2013-04-22 14:57:02 UTC
Comment on attachment 78333 [details] [review]
[1.6, master] [v2] Accept non-characters when validating Unicode

Review of attachment 78333 [details] [review]:
-----------------------------------------------------------------

Ship it!
Comment 9 Simon McVittie 2013-04-22 15:28:38 UTC
Fixed in git for 1.7.2, 1.6.10.

Any chance you could review Bug #63166, which breaks the build on recent Linux systems, including mine? I think that's the only release blocker at the moment.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.