Various Mageia users report that the D-Bus update for CVE-2014-3639 causes intermittent slow startup and failure to activate D-Bus services. Resetting the default auth_timeout to what it was before fixing that CVE (30000ms = 30s, whereas currently it is 5000ms = 5s) seems to address that issue.
Debian bug https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=769069 might be the same thing.
It seems that this is more likely to happen on older/slower systems, which would explain why our testing didn't pick this up: the affected Mageia users' machines include AMD Sempron (released 2004-2009), Pentium 4 (2000-2006), Phenom II (2009-2010) and one user reports that "problems appear only on some computers (laptop, old PC, etc.) but not on others (new PC)".
I'm still surprised that D-Bus would ever take as much as 5 seconds to authenticate, but with an aggressively parallel systemd-based boot on a computer several years old, perhaps the worst case is pretty slow.
* complain to syslog when that limit is reached (ideally, when any configurable
limit is reached)
* test whether it can be reverted to 30s without re-introducing
a denial of service risk (CVE-2014-3639)
Related: Bug #80919, Bug #80851.
I tried installing Debian 8 with systemd and the GNOME desktop on an old laptop (Lenovo X60s with 32-bit Centrino Duo 1.66GHz) and I didn't get services starting to fail to connect until I turned the auth_timeout down to 5ms. At 1ms, most services were failing.
Possibly it's the hard disk speed rather than the CPU speed that matters? I used a slightly newer laptop disk for this test.
Created attachment 109674 [details] [review]
Log to syslog when auth_timeout drops an incomplete connection
This is a symptom of either a denial of service attack, or a
serious performance problem. Either way, sysadmins should know.
This does not fix this bug, but it might diagnose it.
(In reply to Simon McVittie from comment #0)
> We should:
> * complain to syslog when that limit is reached (ideally, when any
> limit is reached)
That's the patch I already attached, which I think we should apply to 1.8. I think the other limits should be a separate bug, for 1.9, unless/until we discover that people are hitting them in practice.
> * test whether it can be reverted to 30s without re-introducing
> a denial of service risk (CVE-2014-3639)
On my slower i386, with 100 processes that do socket(), connect(), recv() until kicked off, close() and repeat:
* auth_timeout=5000 (5s): dbus-send takes 2-3 seconds
* auth_timeout=30000 (30s): dbus-send takes 10-20 seconds
So I think, yes, we can go back up to 30s without too much damage.
I'm also starting to think we need some sort of documentation for how to tune dbus-daemon to reduce denial-of-service attacks even at a functionality cost, because it just isn't feasible to prevent every possible DoS altogether; and when we find a DoS that can't be fixed in general, we can update that documentation to describe how to mitigate it.
Sysadmins who know that (a) their system is fast, and (b) their users are actively hostile to each other, can drop auth_timeout back down (or preferably hit their users with the cluebat until behaviour improves).
Created attachment 109677 [details] [review]
Revert "config: change default auth_timeout to 5 seconds"
This reverts commit 54d26df52b6a394bea175651d1d7ad2ab3f87dea.
It appears this change may cause intermittent slow or failed boot,
more commonly on slower/older machines, in at least Mageia and
possibly also Debian. This would indicate that while the system
is under load, system services are not completing authentication
within 5 seconds.
This change was not the main part of fixing CVE-2014-3639, but does
help to mitigate that attack. As such, increasing this timeout makes
the denial of service attack described by CVE-2014-3639 somewhat
more effective: a local user connecting to the system bus repeatedly
from many parallel processes can cause other users' attempts to
connect to take longer.
If your machine boots reliably with the shorter timeout, and
resilience against local denial of service attacks is important
to you, putting this in /etc/dbus-1/system-local.conf
or a file matching /etc/dbus-1/system.d/*.conf can restore
the lower limit:
This is what I propose as a fix for this regression.
Created attachment 109681 [details] [review]
[for 1.9 only] Upgrade auth_timeout violation to a warning, since this branch has those
1.8 doesn't have log levels other than INFO (which is really LOG_NOTICE for some reason), SECURITY and FATAL. 1.9 now has WARNING too.
Testing by Mageia users on https://bugs.mageia.org/show_bug.cgi?id=14249 is quite positive so far: with Attachment #109674 [details] and the equivalent of Attachment #109677 [details], they do seem to be hitting the auth_timeout log message if and only if they configure a small auth_timeout and get the bad symptoms.
After installing Simon’s Debian packages (1.8.10-1+auth1) from his personal repository I hit the issue right away at the next boot and this time I had to hard reset the system as Ctrl + Alt + Del did not help at all.
Of course I still have not enabled the systemd debug shell. :(
The boot is not listed in the output of `journalctl --list-boots` but RSYSLOG came to the rescue.
The following message is shown a lot!
systemd: Looping too fast. Throttling execution a little.
Please find the log attached.
Created attachment 109745 [details]
Copied output from `/var/log/syslog.1` and `/var/log/syslog.2.gz`
By the way, if I hit that bug again and have the systemd debug shell available, what commands am I supposed to run?
> Nov 19 23:53:36 asrock-e350m1 dbus: [system] Connection has not
> authenticated soon enough, closing it (auth_timeout=5000ms,
> elapsed: 73944ms)
... wow. We hoped this would take 5 seconds, or at worst, 30 seconds, but it's taking your system more than 2 minutes for dbus-daemon to wake up after setting a 5-second timer. No wonder you're having trouble.
With timing like that, it seems likely that the failing connection is systemd itself. If systemd can't connect to dbus-daemon, then there's no way system service activation is going to succeed.
For platforms where we can do getpeereid() or equivalent (known to include at least Linux and OpenBSD) to get a peer connection's uid synchronously, I'm considering making connections from uid 0 exempt from the usual timeout. That would mean that root-privileged services like systemd can't possibly fail to connect in this way - but if it takes as long as 2 minutes, systemd might well time it out from the client side anyway, in which case this wouldn't really help.
> Nov 19 23:54:14 asrock-e350m1 systemd: Looping too fast.
> Throttling execution a little.
I wonder whether this indicates that systemd (or possibly some other process) is busy-looping and starving dbus-daemon of CPU time.
> By the way, if I hit that bug again and have the systemd debug shell
> available, what commands am I supposed to run?
Sorry, I don't have any specific ideas other than capturing the Journal.
Using top, iotop or similar to check for processes that are obviously not working as intended (using all the CPU, using all the memory, rapid memory growth, doing a lot of I/O...) would be useful information.
I notice that you have various kernel debug flags enabled, including those for drm and initcalls: I wouldn't be surprised if this was slowing things down for you. Did you enable this in order to debug some other problem? Do you still have that problem? Can you try without the extraneous debugging and see whether that works any better?
I've had successful test results for Attachment #109674 [details] and Attachment #109677 [details], and they're straightforward changes that fix a serious regression, so I am going to apply them soon and do a 1.8.12 release if not vetoed by another maintainer. Reviews welcome.
Review for Attachment #109681 [details] and Bug #86442 would also be appreciated.
Fixed in 1.6.28, 1.8.12, and shortly in 1.9.4.