Summary: | [945GM] display freezes a few minutes after resuming | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | Martin Pitt <martin.pitt> | ||||||||||||||||||||||||
Component: | Driver/intel | Assignee: | Jesse Barnes <jbarnes> | ||||||||||||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> | ||||||||||||||||||||||||
Severity: | critical | ||||||||||||||||||||||||||
Priority: | high | CC: | brian, clotho67, cwillu, eric, kui.zheng, lool, lubos.kolouch, nalimilan, neitzke, peng.li, tmezzadra, zack.evans | ||||||||||||||||||||||||
Version: | git | Keywords: | NEEDINFO, regression | ||||||||||||||||||||||||
Hardware: | x86 (IA32) | ||||||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||||||||
Attachments: |
|
Description
Martin Pitt
2009-03-07 00:41:29 UTC
Created attachment 23611 [details]
dmesg
dmesg output (nothing interesting after the freeze). This is a clean boot, hibernate, and resume.
Created attachment 23612 [details]
registers after clean boot
Created attachment 23613 [details]
registers after hibernate
Probably not too interesting, since right after hibernate, everything works fine, but for completeness:
$ diff -U0 regs.cleanboot.txt regs.afterhibernate.txt
--- regs.cleanboot.txt 2009-03-06 18:49:36.000000000 +0100
+++ regs.afterhibernate.txt 2009-03-07 08:24:27.000000000 +0100
@@ -34 +34 @@
-(II): LVDS: 0xc0308300 (enabled, pipe B, 18 bit, 1 channel)
+(II): LVDS: 0x40300300 (disabled, pipe B, 18 bit, 1 channel)
@@ -46 +46 @@
-(II): PFIT_CONTROL: 0x00000000
+(II): PFIT_CONTROL: 0x00002668
@@ -166 +166 @@
-(II): pipe B dot 77142 n 2 m1 14 m2 8 p1 2 p2 14
+(II): pipe B dot 108000 n 2 m1 14 m2 8 p1 2 p2 10
Created attachment 23614 [details]
registers after screen freeze
Created attachment 23615 [details]
Xorg.0.log
Created attachment 23617 [details]
stracing X after freeze
This is from ssh'ing into the frozen box and attaching strace to X. I see
ioctl(11, 0x6458, 0) =
Then I walked over, wiggled the mouse a bit, and pressed two keys. The strace shows that apparently those events were still received, and it didn't get stuck in a tight infinite loop or something like this. Thus I think that by and large the server still worked.
However, it should be noted that I tried to press "q" to quit the mutt I was working on when the freeze started. Going back to the ssh session mutt was still running, so I don't think that the "q" keypress actually made it all the way through to mutt. So maybe it's not just a screen freeze, but a little harder than that.
Trying to attach gdb wasn't very successful unfortunately. I do have the debug symbols of X.org, libx11, libc6, etc. installed, but still the stack trace is totally useless. Perhaps the "Cannot access memory at address 0xffe85fec" has something to do with it, but I don't know why it's doing that. $ ps aux|grep X root 3470 0.0 5.2 115892 53076 tty7 Ss+ Mar06 0:45 /usr/bin/X :0 -br -audit 0 -auth /var/lib/gdm/:0.Xauth -nolisten tcp vt7 martin 6497 0.0 0.0 3348 816 pts/0 S+ 08:39 0:00 grep X 0 martin@tick:~/xdebug $ sudo gdb /usr/bin/X GNU gdb 6.8-debian Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i486-linux-gnu"... (no debugging symbols found) (gdb) attach 3470 Attaching to program: /usr/bin/X, process 3470 Cannot access memory at address 0xffe85fec (gdb) bt #0 0xb7f2b430 in ?? () #1 0xb783fee2 in ?? () #2 0xb77d60ff in ?? () #3 0x0817c0eb in ?? () #4 0x08145088 in ?? () #5 0x080910c8 in ?? () #6 0x081319a4 in ?? () #7 0x0808d1ce in ?? () #8 0x080721fd in ?? () #9 0xb7af5775 in ?? () #10 0x080716b1 in ?? () (gdb) quit The program is running. Quit anyway (and detach it)? (y or n) y Detaching from program: /usr/bin/X, process 3470 XrandR information: (LVDS off, external TFT on, laptop is docked and closed): $ xrandr Screen 0: minimum 320 x 200, current 1280 x 1024, maximum 1280 x 1280 VGA disconnected (normal left inverted right x axis y axis) LVDS connected (normal left inverted right x axis y axis) 1280x800 59.8 + 1024x768 85.0 75.0 70.1 60.0 832x624 74.6 800x600 85.1 72.2 75.0 60.3 56.2 640x480 85.0 72.8 75.0 59.9 720x400 85.0 640x400 85.1 640x350 85.1 TMDS-1 connected 1280x1024+0+0 (normal left inverted right x axis y axis) 340mm x 270mm 1280x1024 75.0*+ 60.0 1280x960 60.0 1152x864 75.0 1024x768 85.0 75.0 70.1 60.0 832x624 74.6 800x600 85.1 72.2 75.0 60.3 56.2 640x480 85.0 75.0 72.8 66.7 59.9 720x400 70.1 TV disconnected (normal left inverted right x axis y axis) Created attachment 23618 [details]
lspci -vvnn
that bit just says that the ring is busy -- it's probably just a side effect of the chip being hung. Finally, my xorg.conf: $ cat /etc/X11/xorg.conf Section "Device" Identifier "Configured Video Device" Option "FramebufferCompression" "off" EndSection I need to set this option because of bug 19304. I confirm that this also happens if I use the laptop undocked, with just the internal LVDS: $ xrandr Screen 0: minimum 320 x 200, current 1280 x 800, maximum 1280 x 1280 VGA disconnected (normal left inverted right x axis y axis) LVDS connected 1280x800+0+0 (normal left inverted right x axis y axis) 261mm x 163mm 1280x800 59.8*+ 1024x768 85.0 75.0 70.1 60.0 832x624 74.6 800x600 85.1 72.2 75.0 60.3 56.2 640x480 85.0 72.8 75.0 59.9 720x400 85.0 640x400 85.1 640x350 85.1 TMDS-1 disconnected (normal left inverted right x axis y axis) TV disconnected (normal left inverted right x axis y axis) I can confirm this issue, it happens in Gentoo and Arch for me... it is very annoying I have now upgraded to Linux 2.6.28.8 and -intel 2.6.3, and suspend/hibernate now works fine again, no hangs any more. Thus I tentatively close this now. Lubos, if it still happens for you with the latest version, please reopen. Sorry, just got it again. It seems to happen a lot less often now, but still there. It *just* happened to me as well. I did several times suspend & resume during the weekend, all OK, but now X stopped responding. Gentoo kernel 2.6.28-r4, xf86-video-intel-2.6.3-r1 After latest upgrade it happens again 100% of the time... work->hibernate->resume->wait->freeze->reboot gentoo-sources-2.6.29 xf86-video-intel-2.6.3-r1 mesa-7.3-r1 Confirmed that this still happens with the latest (v 5) patch in bug 18651, so this is apparently not related to pipe underruns. I wonder if it is not related to http://bugzilla.kernel.org/show_bug.cgi?id=12778 Indeed, I also get this message when it happens: Mar 29 23:32:54 tick kernel: [14858.069290] [drm:i915_get_vblank_counter] *ERROR* trying to get vblank count f or disabled pipe 1 Mar 29 23:32:54 tick kernel: [14858.074255] mtrr: no MTRR for d0000000,10000000 found I confirm that running X with Option "DRI" "off", and rmmod'ing i915 and drm, suspend works fine. This might indicate that http://bugzilla.kernel.org/show_bug.cgi?id=12778 is indeed the cause of this. Can you confirm that you're not running 'vbetool post' or with any of the ACPI S3 reposting stuff? That's caused problems for us in the past... I just gave a thorough testing to the pm-utils scripts and quirks, and confirm that /usr/lib/pm-utils/sleep.d/98smart-kernel-video still does the right thing. I. e. it filters out all quirks for intel on >= 2.6.26 and thus does not run any quirks (and thus no VBE post/S3 stuff). So far it looks ok on my 945 with the latest Jaunty bits (so 2.6.28-11-generic and xf86-video-intel 2.6.3), but I've only been waiting a few minutes (while moving windows around and browsing the web). Can you reproduce it with the 2.6.3 driver? It has quite a few fixes that might be relevant. When I tested the suspend quirks, I was running with DRI enabled again (on current Jaunty, i. e. with 2.6.3). It indeed survived for about 10 minutes, then it froze. This also happened to a colleague of mine here at the CELF/LF summit, who also has a 945. As I said, it is totally erratic. I had it survive for as much as 2 hours, then only for 1 minute, in most of the cases it's like 5 minutes. I couldn't see a pattern when it happens wrt. to the actions performed. In many cases I was just reading something and didn't even move the mouse. We finally found the reason for this. Our kernel had the patch from http://bugzilla.kernel.org/show_bug.cgi?id=12950 applied, to improve performance for netbooks. This patch was now identified as causing this regression, and we reverted it. Thus I close this bug report now. Lubos, if you want to "take over" this bug, please reopen; perhaps you could check if above patch is in Gentoo as well? Thanks for the update Martin... It's strange that the MCHBAR patch would cause problems with suspend/resume though. I'll look through the patch again but if you get a chance could you try running with the patch but with tiling disabled in your xorg.conf (option "tiling" "false")? Martin, it happens to me also with vanilla kernel. I booted the previous kernel with the MCHBAR patch, disabled tiling, suspended, and it hanged again after about an hour. I have run with the updated kernel (with the MCHBAR patch reverted) all day, and on the conf I'm using suspend/resume a lot. No hang here. However, I haven't looked what that MCHBAR patch was about. I cannot assert whether reverting it really fixed the suspend hang to 100%, or whether it was just sheer luck that it survived a day. Before that, I got the hang pretty reliably within an hour, though. It seems I just was lucky yesterday, it survived the entire day without freezing. But sure enough, when I kept my laptop suspended over night and resumed this morning, it froze after a couple of minutes. So it was unrelated to the MCHBAR patch after all. Darn! :-/ Thanks for testing Martin, I'll see if I can reproduce locally (again, I guess I'm in for lots of waiting). If you could capture a backtrace via gdb of the hung server that might help a lot. I think I did already, and it delivered nothing but ??. Also, I don't think it's actually hung, since I can still strace it and see mouse/keyboard activity. But I'll try harder to gdb it once I'm back home next week (with just a single laptop at the conference I don't have a place to ssh into the box). Oh yeah you did, forgot about that. I'm not sure why gdb wasn't able to attach properly but hopefully you can figure that out and get a useful trace. I usually just su and do it as root rather than using sudo (not sure how that affects uid and effective uid etc). Created attachment 24962 [details]
GPU dump with 2.6.30rc2
I tried to reproduce this with linux 2.6.30RC2 and libdrm 2.4.9, so that I could use intel_gpu_dump (standard Jaunty, where I encountered the hang before, has 2.6.28.8 and libdrm 2.4.5). However, the symptomps are now slightly different, so I'm not sure whether this is useful at all:
- I get hangs without any special VT switches/suspend/etc after a few hours.
- After suspend, the first hang again occurs after a few minutes
- Unlike with standard jaunty, I can recover from the hang with a VT switch, but then it again happends after a few minutes. GPU dump attached (compressed, sorry, raw file was too big for bugzilla)
- This also happens without compositing (where as disabling compiz was a good workaround for the original bug here).
For each hang that happens, I get
[ 204.095061] [drm:i915_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 1
in dmesg.
I attached similar dump of frozen GPU to #20560 ... seems like we are tracing the same issue in two bugs... Lubos, thanks. However, please note that the GPU dump is for the hangs which happen on 2.6.30RC2, which behave very different to the hangs I get on 2.6.28.8. I just can't use intel_gpu_dump on the latter, so this was my (vain) attempt to provide info for the original hang. Martin, my dump is also from 2.6.30RC2 and it behaves exactly the same for me as in 2.6.28 and 2.6.29 ! I can't get away from it just by changing the VT! For the record, I now updated to linux 2.6.30rc3, -intel 2.7.0, libdrm 2.4.9, and turned on UXA. Things are running smoothly now, and I suspended about 5 times during the afternoon/evening without any problem. Ok, marking fixed. Thanks Martin. As mentioned in #20560 , this is far from fixed... Created attachment 26643 [details]
Dump with 2.6.30-rc8-git6
Created attachment 26783 [details]
KMS/composite freeze logs from Martin Pitt
It had worked fine for some weeks (KMS+compiz) on my i945, but now it's back. I'm following Ubuntu's "xorg-edgers" archive which has very current snapshots of upstream. Unlike most regressions that I see, this one isn't just a temporary glitch, it's been broken for over a week now. It now freezes about two seconds after resuming, not several minutes, but otherwise the symptoms are very similar. Should I open a new bug about this, or is it the same? Logs attached (dmesg, gpu, registers, Xorg.log). My current versions:
Linux 2.6.30 final, with git pull from anholt/drm-intel.git (commit 03d606991)
libdrm from 2009-06-06 (3d4bfe8c)
mesa from 2009-06-13 (18af7c38)
intel from 2009-06-11 (6d062e9e)
I tried the following combinations:
- KMS, X.org session with compiz: usually freezes; seldomly it survives first suspend, freezes on second
- no KMS, X.org session with compiz: ok
- KMS, VT only: ok
- KMS, gdm only (no composite): ok
- KMS, X.org session with metacity (no composite): ok
- KMS, X.org with compiz, switch to VT1 before suspend: ok on resume, often freezes as soon as switching back to X.org
We tested this bug on 945GM with master branch, display will freeze right after system wake from S4 if we are running gnome with or without compiz. If we run raw X, most of time the system could wake from S4 correctly, but one time, it crashed the whole system. S3 works fine. *** Bug 22039 has been marked as a duplicate of this bug. *** Ug, ok sounds like there are real issues with KMS resume. Let's keep S3 and S4 separate though; can someone seeing an issue with hibernate file a separate bug? *** Bug 22010 has been marked as a duplicate of this bug. *** Created attachment 26881 [details]
script to do s3 automatically
This is a script to do S3 resume automatically, should be help to reproduce this issue
I met the same problem in moblin, after 3 times S3 resume, screen become blank. I got the regdump diff of good and bad s3 resume, same as above -(II): MI_MODE: 0x00000200 +(II): MI_MODE: 0x00000000 (In reply to comment #44) > Ug, ok sounds like there are real issues with KMS resume. Let's keep S3 and S4 > separate though; can someone seeing an issue with hibernate file a separate > bug? > Is bug#22263 the hibernation bug? (In reply to comment #46) > Created an attachment (id=26881) [details] > script to do s3 automatically > > This is a script to do S3 resume automatically, should be help to reproduce > this issue > Maybe 10 sec is not enough. I change the sleep and wake up time to 15sec, and test 20 times suspend/resume, it works well. Gordon: bug#22263 is not the hibernation problem I'm seeing, and doesn't seem to be Martin's either (comment #41). I don't get any screen corruption. See bug 22366. (In reply to comment #45) > *** Bug 22010 has been marked as a duplicate of this bug. *** > I'm not sure whether this is a duplicate of this bug. I have done some tests. I'm sure kernel 2.6.29.4 and 2.6.30-rc5 is good. The screen corruption and X hang only occur on kernel after 2.6.30-rc6. I'll try do some bisect to see which commit is suspicious. Well, git bisect shows that revert commit: 79f11c19a396e8cea7dad322dcfb46c0a8517fe6 drm/i915: save/restore fence registers across suspend/resume make kernel 2.6.30 resume works again. kernel 2.6.30-rc5 + the above commit doesn't cause this hang, so it could be some conflict between this commit and other commits for kernel 2.6.30-rc6. Here is some addition info. i915_gem_fence_regs before suspend: Reserved fences = 3 Total fences = 16 Fenced object[ 0] = unused Fenced object[ 1] = unused Fenced object[ 2] = unused Fenced object[ 3] = f676c360: P 00c00000 00400000 00001000 X 00000002 00000002 0 (name: 1) Fenced object[ 4] = f6901f00: 02000000 00400000 00001000 X 00000002 00000002 0 (name: 2) Fenced object[ 5] = f6901f60: 02400000 00400000 00001000 X 00000002 00000002 0 (name: 3) Fenced object[ 6] = unused Fenced object[ 7] = unused Fenced object[ 8] = unused Fenced object[ 9] = unused Fenced object[10] = unused Fenced object[11] = unused Fenced object[12] = unused Fenced object[13] = unused Fenced object[14] = unused Fenced object[15] = unused i915_gem_fence_regs after resume: Reserved fences = 3 Total fences = 16 Fenced object[ 0] = unused Fenced object[ 1] = unused Fenced object[ 2] = unused Fenced object[ 3] = f6042780: P 00c00000 00400000 00001000 X 00000002 00000000 0 (name: 1) Fenced object[ 4] = unused Fenced object[ 5] = unused Fenced object[ 6] = unused Fenced object[ 7] = unused Fenced object[ 8] = unused Fenced object[ 9] = unused Fenced object[10] = unused Fenced object[11] = unused Fenced object[12] = unused Fenced object[13] = unused Fenced object[14] = unused Fenced object[15] = unused (In reply to comment #52) Sorry, this is the one after resume. i915_gem_fence_regs after resume: Reserved fences = 3 Total fences = 16 Fenced object[ 0] = unused Fenced object[ 1] = unused Fenced object[ 2] = unused Fenced object[ 3] = f676c360: P 00c00000 00400000 00001000 X 00000002 00000000 0 (name: 1) Fenced object[ 4] = unused Fenced object[ 5] = unused Fenced object[ 6] = unused Fenced object[ 7] = unused Fenced object[ 8] = unused Fenced object[ 9] = unused Fenced object[10] = unused Fenced object[11] = unused Fenced object[12] = unused Fenced object[13] = unused Fenced object[14] = unused Fenced object[15] = unused If fence register save/restore really is the issue, this patch should help. Current code saves the fence registers before rendering has completed, which can affect fence register allocation. If we save before rendering completes, and restore again at resume time, we may end up causing trouble with whatever objects land in the fenced space after resume. Saving register state (including fences) *after* we've idled the memory manager should help with that. diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 98560e1..e3cb402 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -67,8 +67,6 @@ static int i915_suspend(struct drm_device *dev, pm_message_t s pci_save_state(dev->pdev); - i915_save_state(dev); - /* If KMS is active, we do the leavevt stuff here */ if (drm_core_check_feature(dev, DRIVER_MODESET)) { if (i915_gem_idle(dev)) @@ -77,6 +75,8 @@ static int i915_suspend(struct drm_device *dev, pm_message_t s drm_irq_uninstall(dev); } + i915_save_state(dev); + intel_opregion_free(dev, 1); if (state.event == PM_EVENT_SUSPEND) { (In reply to comment #54) > If fence register save/restore really is the issue, this patch should help. > Yes, it does help my problem. The system can resume correctly again. I didn't see a hang so far. I tested the patch in comment 54 and also confirm that it fixes suspend/resume with the internal laptop monitor. Thanks! It still fails with the external one, but that's a different problem, and I'm going to report it separately. (In reply to comment #54) > If fence register save/restore really is the issue, this patch should help. applied the patch here and it appears to have fixed it for me.. intel gma950 laptop. Great, thanks for testing. Fix has been pushed into the kernel: commit 9e06dd39f2b6d7e35981e0d7aded618686b32ccb drm/i915: correct suspend/resume ordering (In reply to comment #58) > Great, thanks for testing. Fix has been pushed into the kernel: > > commit 9e06dd39f2b6d7e35981e0d7aded618686b32ccb > drm/i915: correct suspend/resume ordering The fix is in drm-intel-next branch. Eric, please cherry-pick it into qa-branch so it'll be in Q2 package. (In reply to comment #58) > Great, thanks for testing. Fix has been pushed into the kernel: > > commit 9e06dd39f2b6d7e35981e0d7aded618686b32ccb > drm/i915: correct suspend/resume ordering > Maybe this fix should also be send to 2.6.30.x stable branch, since it's a regression during the 2.6.30 rc process. And it will make user of the stable kernel happy. Thanks. On Tue, 23 Jun 2009 20:16:32 -0700 (PDT) > --- Comment #60 from Jie Luo <clotho67@gmail.com> 2009-06-23 > 20:16:32 PST --- (In reply to comment #58) > > Great, thanks for testing. Fix has been pushed into the kernel: > > > > commit 9e06dd39f2b6d7e35981e0d7aded618686b32ccb > > drm/i915: correct suspend/resume ordering > > > > Maybe this fix should also be send to 2.6.30.x stable branch, since > it's a regression during the 2.6.30 rc process. And it will make user > of the stable kernel happy. Thanks. Good point, want to send a note to stable@kernel.org with the commit info, proposing the patch for inclusion? Thanks, |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.