Bug 76112 - "Message did not receive a reply (timeout by message bus)" is misleading
Summary: "Message did not receive a reply (timeout by message bus)" is misleading
Status: RESOLVED FIXED
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: 1.5
Hardware: All All
: medium normal
Assignee: Simon McVittie
QA Contact: D-Bus Maintainers
URL:
Whiteboard: review?
Keywords: patch
Depends on:
Blocks:
 
Reported: 2014-03-13 10:08 UTC by Simon McVittie
Modified: 2014-10-29 16:22 UTC (History)
8 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Use a better NoReply message for disconnection with reply pending (7.23 KB, patch)
2014-03-13 12:08 UTC, Simon McVittie
Details | Splinter Review
dbus-daemon test: don't disconnect before WaitForever call is delivered (2.33 KB, patch)
2014-09-15 17:59 UTC, Simon McVittie
Details | Splinter Review

Description Simon McVittie 2014-03-13 10:08:09 UTC
dbus-daemon has this logic for replies to method calls:

* suppose :1.0 sends a method call to :1.1

* if :1.1 replies to that method call (with success or an error),
  everything's fine

* if :1.1 disconnects without replying, synthesize and send a NoReply
  error

* if reply_timeout > 0 and :1.1 does not reply within reply_timeout
  milliseconds, synthesize and send a NoReply error

Since 2009, the default for reply_timeout is -1 (Bug #16841), so in practice there is no timeout.

However, the message in the NoReply error is still "Message did not receive a reply (timeout by message bus)", which is misleading when it appears within milliseconds of the call (<http://bugs.debian.org/597133>).
Comment 1 Simon McVittie 2014-03-13 12:08:14 UTC
Created attachment 95717 [details] [review]
Use a better NoReply message for disconnection with reply  pending

As an implementation detail, dbus-daemon handles this situation by
artificially triggering a timeout (even if its configured timeout for
method calls is in fact infinite). However, using the same debug message
for both is misleading, and can lead people who are debugging a service
crash to blame dbus-daemon instead, wasting their time.

---

Tested on master, but I'd like to apply this to dbus-1.8 too, since it's low-risk and is beneficial for debugging.
Comment 2 Simon McVittie 2014-04-28 14:05:16 UTC
Reviews/feedback, anyone?

(People who like kdbus: what messages would be produced there, and are they any better than mine?)
Comment 3 Lennart Poettering 2014-04-28 15:21:50 UTC
Patch looks good. In libsystemd-bus we do the same thing: the kernel will report two different notifications to us depending on timeout or peer dying. However, to stay compatible we make a NoReply bus error from that, but do use different messgae strings. 

All good from me.
Comment 4 Simon McVittie 2014-04-28 18:44:58 UTC
I'm getting intermittent test failures, so the actual implementation is probably fine, but the tests might not be. Here's one such failure:

ERROR:/home/smcv/src/fdo/dbus/test/dbus-daemon.c:408:test_no_reply: assertion failed (f->e.name == DBUS_ERROR_NO_REPLY): ("org.freedesktop.DBus.Error.ServiceUnknown" == "org.freedesktop.DBus.Error.NoReply")

so that will need some debugging.
Comment 5 Thiago Macieira 2014-04-28 20:26:48 UTC
Comment on attachment 95717 [details] [review]
Use a better NoReply message for disconnection with reply  pending

Review of attachment 95717 [details] [review]:
-----------------------------------------------------------------

connection.c looks good, I can't comment on the test because I don't understand the test framework.
Comment 6 Alban Crequy 2014-07-29 16:02:54 UTC
(In reply to comment #4)
> I'm getting intermittent test failures, so the actual implementation is
> probably fine, but the tests might not be. Here's one such failure:
> 
> ERROR:/home/smcv/src/fdo/dbus/test/dbus-daemon.c:408:test_no_reply:
> assertion failed (f->e.name == DBUS_ERROR_NO_REPLY):
> ("org.freedesktop.DBus.Error.ServiceUnknown" ==
> "org.freedesktop.DBus.Error.NoReply")
> 
> so that will need some debugging.

How do you run the tests?

The test passes with:
$ cd test ; make check TESTS=test-dbus-daemon

/no-reply/disconnect: OK
/no-reply/timeout: OK
PASS: test-dbus-daemon

But it fails with:
$ cd test ; DBUS_TEST_DATA=data ./test-dbus-daemon

/no-reply/disconnect: **
ERROR:dbus-daemon.c:414:test_no_reply: assertion failed (f->e.message == "Message recipient disconnected from message bus without replying"): ("Message did not receive a reply (timeout by message bus)" == "Message recipient disconnected from message bus without replying")
Aborted

Because it runs /usr/bin/dbus-daemon instead of running dbus-daemon built from the source tree I am testing. So I wonder if your $DBUS_TEST_DAEMON is set correctly.

I can't reproduce the error when DBUS_TEST_DAEMON is set correctly and your patch looks good to me.
Comment 7 Simon McVittie 2014-09-15 17:29:25 UTC
(In reply to comment #4)
> I'm getting intermittent test failures, so the actual implementation is
> probably fine, but the tests might not be. Here's one such failure:
> 
> ERROR:/home/smcv/src/fdo/dbus/test/dbus-daemon.c:408:test_no_reply:
> assertion failed (f->e.name == DBUS_ERROR_NO_REPLY):
> ("org.freedesktop.DBus.Error.ServiceUnknown" ==
> "org.freedesktop.DBus.Error.NoReply")

I think this is a rare race condition. I'm not surprised you couldn't reproduce it.

% for x in `seq 1 1000`; do DBUS_TEST_DATA=$builddir/test/data DBUS_TEST_DAEMON=$builddir/bus/dbus-daemon $builddir/test/test-dbus-daemon -p /no-reply/disconnect || { echo "run $x failed with $?"; break; } ; done
...
ERROR:/home/smcv/src/fdo/dbus/test/dbus-daemon.c:434:test_no_reply: assertion failed (f->e.name == DBUS_ERROR_NO_REPLY): ("org.freedesktop.DBus.Error.ServiceUnknown" == "org.freedesktop.DBus.Error.NoReply")
run 598 failed with 134
Comment 8 Simon McVittie 2014-09-15 17:59:48 UTC
Created attachment 106341 [details] [review]
dbus-daemon test: don't disconnect before WaitForever call is  delivered

---

To be squashed into Attachment #95717 [details].

The race condition was that on rare occasions, dbus-daemon would process the disconnection of the "right" connection before it processed the method call from the "left" connection; then when it got round to processing the method call, it would reply "no such name", because the "right" connection already went away.
Comment 9 Simon McVittie 2014-09-15 18:04:30 UTC
(In reply to comment #7)
> I think this is a rare race condition.

It seems to work reliably with that patch: > 2000 consecutive runs succeeded.
Comment 10 Simon McVittie 2014-10-09 10:49:54 UTC
This has now been here for several weeks with no objections. If nobody reviews it in the next 2 weeks, I intend to go for "easier to ask forgiveness than permission" and commit it to 1.9.
Comment 11 Simon McVittie 2014-10-29 16:22:47 UTC
(In reply to Simon McVittie from comment #10)
> This has now been here for several weeks with no objections. If nobody
> reviews it in the next 2 weeks, I intend to go for "easier to ask
> forgiveness than permission" and commit it to 1.9.

Done. Fixed in git for 1.9.2


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.