Description
Ronald
2012-05-19 09:33:58 UTC
Created attachment 61852 [details]
Dmesg output of correct resume
On a small sidenote, compiling current head with !CONFIG_DRM_NOUVEAU_BACKLIGHT results in: http://pastebin.com/8TvZNDqz Current head means: 1. git pull git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 2. git pull git://anongit.freedesktop.org/nouveau/linux-2.6 Through googling I found out that drm has it's own debug facility (duh...). I have added 4 files that are generated with the following kernel boot parameters: drm.debug=15 log_buf_len=32M The 'bad' state contains the bootlog and suspend log of the kernel at commit 5d720f2450 from the nouveau/linux-2.6 tree. The 'good' state contains the bootlog and suspend log of the kernel at the commit before 5d720f2450 fron the nouveau/linux-2.6 tree. (Which is the default kernel for now.) I'm out of idea's right now, so any pointers would be great. Created attachment 61858 [details]
Dmesg boot log of the 'bad' kernel
Created attachment 61859 [details]
Dmesg resume log of the 'bad' kernel
Created attachment 61860 [details]
Dmesg boot log of the 'good' kernel
Created attachment 61861 [details]
Dmesg resume log of the 'good' kernel
I tried kernel 3.5-rc1, because it seems that some recent patches were left out during the merge. I was hoping that this newer kernel would somehow solve this. However, I'm still having the same issue's in this release. On a further note, the commit before 5d720f2450 was a pretty good release for this card. Glxgears reported almost 400fps and window resizing was smooth. With v3.4.1 or v3.5-rc1, this performance regressed to it's original state. Glxgears reports 100ftps and resizing is choppy again. Now that I have found out the root cause of my resume trouble (documented in 53101) I found this bug. Here's the info so far: (replicating from bug 53101) ThinkPad W520 4276CTO NVC3 (2000M) openSUSE 12.2 + nouveau 20120813 872dcac * Booting works (nox2apic, W520 ACPI table issue) * gdm has graphics distortions though (see early dmesg excerpt) * double ctrl+alt+backspace "fixes" this and gdm looks good * suspend from gnome-shell 3.4.2 works * resume shows gdm-password prompt and usually a white-noise background ** the gnome-shellish top-panel looks intact, though ** mouse cursor not movable, cpu load ** looks like "something" tries to restart gdm/X over and over again * switching to vt possible with some insisting * restarting gdm does lock up the system * the "channel x kick timeout" seems new since some commits IIRC repeatedly in dmesg: [ 156.925301] nouveau E[ PFIFO][0000:01:00.0] playlist update failed [ 159.924800] nouveau E[ DRM][0000:01:00.0] failed to idle channel 0xcccc0000 [ 161.924690] nouveau E[ PFIFO][0000:01:00.0] channel 1 kick timeout [ 161.924787] nouveau [ PFIFO][0000:01:00.0] unknown status 0x00000100 [ 163.924603] nouveau E[ PFIFO][0000:01:00.0] playlist update failed [ 163.989722] nouveau [ PFIFO][0000:01:00.0] unknown status 0x00000100 [ 165.989535] nouveau E[ PFIFO][0000:01:00.0] channel 3 kick timeout [ 165.989670] nouveau [ PFIFO][0000:01:00.0] unknown status 0x00000100 [ 167.989455] nouveau E[ PFIFO][0000:01:00.0] playlist update failed [ 167.989517] nouveau ![ PFIFO][0000:01:00.0] unhandled status 0x00000001 [ 170.649537] nouveau E[ PFIFO][0000:01:00.0] playlist update failed [ 172.660200] nouveau E[ PFIFO][0000:01:00.0] playlist update failed [ 185.103713] nouveau E[ DRM][0000:01:00.0] failed to idle channel 0xcccc0001 [ 187.103627] nouveau E[ PFIFO][0000:01:00.0] channel 2 kick timeout I tried a fc17 install and the original kernel (3.3.4-5.fc17.x86_64) worked. Suspend/resume fine at least when not in docking station. After updating that test install to 3.5.1-1.fc17.x86_64 the same issues cropped up I see in openSUSE 12.2. So this looks distribution agnostic. -- Bisection rounds testing successful suspend/resume cycles on NVC3/2000M: note: * gdm greeter is showing garbage (screen content from before reboot) somewhere before the last known good commits ** this issue was ignored and still present in the last good commit but is not the topic of this bug $ git bisect log # bad: [f9b495fca46836a6a05cedde8058ccb8a3e62c3d] drm/nouveau: use ioread32_native/iowrite32_native for fifo control registers # good: [f887c425f9eeed8ffbca64c8be45da62b07096c0] drm/nouveau: bump version to 1.0.0 git bisect start 'HEAD' 'f887c425f9eeed8ffbca64c8be45da62b07096c0' '--' 'drivers/gpu/drm/nouveau/' # bad: [9bd0c15fcfb42f6245447c53347d65ad9e72080b] drm/nouveau/fbcon: using nv_two_heads is not a good idea git bisect bad 9bd0c15fcfb42f6245447c53347d65ad9e72080b # good: [5132f37700210740117f5163b5df7aa1c8469a55] drm/nve0/fifo: initial implementation git bisect good 5132f37700210740117f5163b5df7aa1c8469a55 # bad: [71af5e62db5d7d6348e838d0f79533653e2f8cfe] drm/nv50/gr: make sure NEXT_TO_CURRENT is executed even if nothing done git bisect bad 71af5e62db5d7d6348e838d0f79533653e2f8cfe # good: [afada5e0bb3cac8530c2ae36aa0abca41d60e063] drm/nv04/disp: disable vblank interrupts when disabling display git bisect good afada5e0bb3cac8530c2ae36aa0abca41d60e063 # bad: [5e120f6e4b3f35b741c5445dfc755f50128c3c44] drm/nouveau/fence: convert to exec engine, and improve channel sync git bisect bad 5e120f6e4b3f35b741c5445dfc755f50128c3c44 # good: [35bcf5d55540e47091a67e5962f12b88d51d7131] drm/nouveau: move flip-related channel setup to software engine git bisect good 35bcf5d55540e47091a67e5962f12b88d51d7131 # good: [d375e7d56dffa564a6c337d2ed3217fb94826100] drm/nouveau/fence: minor api changes for an upcoming rework git bisect good d375e7d56dffa564a6c337d2ed3217fb94826100 5e120f6e4b3f35b741c5445dfc755f50128c3c44 is the first bad commit commit 5e120f6e4b3f35b741c5445dfc755f50128c3c44 Author: Ben Skeggs <bskeggs@redhat.com> Date: Mon Apr 30 13:55:29 2012 +1000 drm/nouveau/fence: convert to exec engine, and improve channel sync Now have a somewhat simpler semaphore sync implementation for nv17:nv84, and a switched to using semaphores as fences on nv84+ and making use of the hardware's >= acquire operation. Signed-off-by: Ben Skeggs <bskeggs@redhat.com> :040000 040000 8f2ca4ddf4969c75f688a96fdb152e449fda4852 da67a1bd8d608577e659a26715cf8af3644d8efe M drivers -- @Ronald * can it be the ticket subject carries the wrong commitish? (because the bad commit we both identified is "5e120f6e4b3f35b741c5445dfc755f50128c3c44" in the nouveau/linux-2.6 tree) * if you like, extend the subject with NVC3/Quadro 2000M Created attachment 65931 [details]
W520-4276CTO-NVC3 dmesg commitish-872dcac gdm + suspend/resume cycle
Thanks for your info and a 'me2'. I updated the title as you requested. I placed the entire commit title. Everytime a new kernel is released, the nouveau project does a 'git rebase' which essentially wipes all patches outside of Linus' tree and places them back one by one. This implies a change in the SHA hash since git treats those patches that are being reapplied (but are essentially the same) and new (and thus different commits). One a sidenote (just to summarize/confirm this bug), it seems that: - symptoms are the exact same - regression fails - cursor invisible - screen is corrupted - no visual response - vt switching works with some persistence - logs are flooded (with generic messages) - bisected patch is the exact same (as in title) - however, as of now, based on different kernel versions (v3.4 vs v3.5) However, there is one nitpick. I have pinned this computer's kernel on 3.4 since 3.5-rc0 and higher suffer from another regression which I have failed to bisect (I forgot why/how that did not work). When I have time this weekend, I will see how far I can get with Linus' tree and then combine with the nouveau tree and we will see what happens. A refresh of testdata might be good since the patches from the nouveau tree allow setting debug levels which is nice. Ah, thanks for clearing that up with the rebased commits. Didn't know they'd get a new hash then. Apart from providing more verbose logs, I am unsure what else one can do here. I think a dev (Ben?) should get us into some direction. The bisected bad commit is a tad bit too huge to understand/revert/trial-and-error-tinker for me ;) *** Bug 55744 has been marked as a duplicate of this bug. *** Same issue here with my EVGA GeForce 450 GTS (1GB). It works fine with the nVidia drivers. Kernel: 3.6.1 nouveau-git & libdrm-git OS: Archlinux Francois, You can you confirm that your regression is also caused by the following commit? drm/nouveau/fence: convert to exec engine, and improve channel sync Michael and my problems are not there when we build a kernel before this patch is applied. Please test to make sure that you are not having a seperate problem. Thank you. Still not working with 3.7-rc4: nouveau [ DRM] re-enabling device... nouveau [ DRM] resuming client object trees... nouveau [ VBIOS][0000:01:00.0] running init tables nouveau W[ PTIMER][0000:01:00.0] unknown input clock freq agpgart-via 0000:00:00.0: AGP 3.5 bridge agpgart: kworker/u:30 tried to set rate=x12. Setting to AGP3 x8 mode. agpgart-via 0000:00:00.0: putting AGP V3 device into 8x mode nouveau 0000:01:00.0: putting AGP V3 device into 8x mode nouveau [ DRM] Loading NV17 power sequencing microcode nouveau [ DRM] resuming display... nouveau [ DRM] Setting dpms mode 3 on TV encoder (output 1) Restarting tasks ... done. nouveau E[ DRM] reloc wait_idle failed: -16 nouveau E[ DRM] reloc apply: -16 nouveau E[ DRM] reloc wait_idle failed: -16 nouveau E[ DRM] reloc apply: -16 nouveau E[ DRM] reloc wait_idle failed: -16 nouveau E[ DRM] reloc apply: -16 Screen stays black on resume. I tested this +nouveau tree which enables z-compression. Linux kernel 3.7-rc4 + nouveau tree (see nouveau.txt for patches): CONFIG_LOG_BUF_SHIFT=21 CONFIG_NOUVEAU_DEBUG=6 CONFIG_NOUVEAU_DEBUG_DEFAULT=6 # 7 gave a >16MB log Did a 'dmesg -c' for each seperate logfile just before doing 'pm-hibernate'. Logfiles are from the same boot. First resume succeeds, second hangs just like the first occurence of this bug. Attaching files... Created attachment 69606 [details]
Dmesg of boot
Created attachment 69607 [details]
Dmesg of succesful first resume
Created attachment 69608 [details]
Dmesg of failed second resume
Created attachment 69609 [details]
v3.7-rc4 + nouveau patchlist
I'm also attaching a diff of both logs. I filtered them using sed to get rid of the timestamps. One thing stands out (to me): The logs of the correct resume mention this: -ACPI: Invalid Power Resource to register! -agpgart-via 0000:00:00.0: Refused to change power state, currently in D0 The rest of the diff contains errors from the bad resume and subsequent trace thereof. Created attachment 69612 [details]
Diff of both resume sessions
Created attachment 69616 [details] [review] Reset agp.stat to unknown during suspend Something to test :P No dice =) Even hangs at suspend sometimes. Besides, this bug is not slot specific. Someone else with a 'NVC3/Quadro 2000M' (PCI-E) also has these problems since the mentioned commit: 'drm/nouveau/fence: convert to exec engine, and improve channel sync'. Just tested v3.8-rc1 + nouveau @ 'drm/nouveau/fan: handle the cases where we are outside of the linear zone' on this FX5200. Especially 'drm/nouveau/bios: cache ramcfg strap on later chipsets' looked promising (even though I don't have a recent chipset... there was hope...). Things did not regress further. First suspend works most of the time, second suspend crashes. Symptoms did change a bit. When resume eventually completes, I do not see a frozen/unresponsive desktop but the screen goes black with red/green (I don't think i saw any blue) pixels scattered all over it. Density was really low, so maybe 200 pixels on the entire screen? These pixels did not move or alterate or anything, it was pretty static and boring actually. But I recommend Michael and francoism to test latest HEAD as the mentioned commit might fix your issue's??? I will add a dmesg from v3.8rc1 kernel for your viewing convenience. This is a dmesg from syslog, not from the 'dmesg' command! Created attachment 72060 [details]
Dmesg of v3.8rc1 + nouveau HEAD
referencing bug #55258 here Kewl, that means that this regression is spread across several cards if you purely look at the symptoms. - NV40 generation card (0x049200a2) - twice on an NV34 [GeForce FX 5200] (IRC: FX5200 is an 'odd beast' :/) - NVC3 (2000M) - EVGA GeForce 450 GTS From two of these, the regression has been confirmed to be resulting from commit: 'drm/nouveau/fence: convert to exec engine, and improve channel sync' as part of the nouveau rework that got upstreamed in kernel 3.5: - NV34 [GeForce FX 5200] by me - NVC3 (2000M) by Michael Weirauch I think it's safe to assume that ppl who are experiencing this regression as well using a kernel on 3.5 or above are suffering from the same issue. Francoism is using 3.6, so that is okay. @Raphaël Droz: What kernel version are you using? And maybe the kernel version from Mr-4 might be useful as well. Just to confirm. Created attachment 72085 [details] [review] fix does this patch fix it? nvc3 should be already fixed in 3.7 Seems to have fixed the bug. It has survived three consecutive suspend and resume cycles in a row. This was not possible after v3.5 without your supplied fix. So it seems that the direct symptoms have been resolved with this patch. But if you don't mind, I would like to test it somewhat further for these days. And I suggest that everyone else with this hardware (Raphael and Mr-4) to test as well. Guess I won't be telling my parents they are running an rc1 kernel hehe... The FX5200 still functions and survived another cycle. I went ahead and tested your patch on a 'G73 (NV4B)' as well. Since modeset became mandatory suspend/resume never worked (anymore). It suspended, but resumed with a garbled console where some colors are off. Eventually, the card looks up. Recent rework made the screen completely black. So the regression regressed even further. Current head with your patch restores the second regression back to where the display is garbled after resume (after which it looks up again). So I'm fairly confident that your patch restores pre rework (v3.5) behaviour on these cards with respect to suspend and resume cycles. I'm going to test an NV4E laptop later, when I have time. Just tested v3.8rc1+nouveau+patch on an NV4E. Suspend and resume works! Though it takes about 1 minute before it shuts down (hangs somewhere). And maybe 1 or 2 minutes to resume (before screen is restored). Restarting is faster, but this looks promising. At least it works. Anyways, no regressions, just improvements. Thanks Santa! Good. Please open new bug reports for the other issues you have with these (4B & 4E) boxes. Created attachment 72113 [details] [review] fix v2 I simplified the fix a bit. Can you test it too? (Just one box would be enough) (Note that you don't need to run it on 3.8-rc1 - it's applicable to 3.7 too) Reverted the old patch and applied the new one. Did this on the same v3.8rc1+nouveau kernel tree. It only rebuilded the nouveau specific files. The FX5200 survived 4 consecutive cycles with this patch. It seems to exhibit the same behaviour as the previous patch: During suspend and resume, for a short moment, there is a blue/green stray pixel in the top left corner. Everything works fine though. Good to go if you ask me. Maybe apply it to 3.5 and 3.6 as well?? As for the other cards I mentioned (NV4E + NV4B) I merely referenced them to test the patch against regressions. Suspend is broken on NV4B since KMS became mandatory. And it seems to work somewhat on NV4E. *My primary aim is to guard against regressions.* FWIW, do you really want seperate bugs for the NV4E and NV4B??? They are just bugs, not regressions. This patch is not directly applicable to kernels < 3.7, although porting it is trivial. WRT other bugs: Yes, I would like to see separate bug reports. Suspend/resume should work for all cards and not take too much time. I don't promise fixing them, of course. Ok, I will file seperate bugs for these cards. FYI, the FX5200 survived 11 cycles so far without a hitch. Thanks a lot =) Marcin, I already filed bug #23223 for the NV4B, I'll update that one. It's quite old, but I'll continue from there. I'll open a new one for the NV4E. I'm waiting on filing a bug for the NV4E. It seems that nouveau works properly, I don't see any delays or errors at all in dmesg. It is whining about the b43 firmware, but even if I unload the driver the resume still takes 1m and 30 seconds. But it works. I'm not sure this is a nouveau bug, hence I'm not filing it. I will dig deeper when I have time. perfect! the last patch did the trick (tested with vanilla 3.7). I hibernated several times, even with a running instance of glxgears. No problem anymore. On my side the startup time is not especially long. thanks *** Bug 55258 has been marked as a duplicate of this bug. *** Created attachment 72499 [details]
W520-4276CTO-NVC3 dmesg 3.8.0-rc1 suspend/resumce cycle screen garbage
Tested NVC3 3.7.1 openSUSE kernel and nouveau 3.8.0-rc1 as of 2013-01-03. Both show same screen distortion (snowy background, button background and window border ok) on gdm when resuming. Though the system is "usable" load wise and doesn't lock-up.
Am I still in the scope of this bug with the following endlessly repeating errors on resume? (Which will go away if killing X.)
[ 1623.833303] nouveau E[ PGRAPH][0000:01:00.0] TRAP ch 1 [0x007fe00000 Xorg[1009]]
[ 1623.833311] nouveau E[ PGRAPH][0000:01:00.0] SHADER 0xa004021e
(In reply to comment #30) > @Raphaël Droz: What kernel version are you using? And maybe the kernel > version from Mr-4 might be useful as well. Just to confirm. Apologies for this late reply, but since I first started experiencing these problems, and with no end in sight, I just got a bit fed up with it and was forced to downgrade my kernel - I am sticking with 3.3.x as it seems very stable as far as the nouveau suspend/resume problems go: I am getting an occasional "blip" (resume failure) about once every 15-20 suspend/resume cycles, which to me is as good and as manageable as it could be expected, given what I experienced with kernels 3.4, 3.5 and 3.6. I am currently waiting for 3.7 to become mainstream for my distro (I am using Fedora) so that I could give it a go, but I don't hold my breath to be honest. Just FYI: the problems I have experienced with kernels 3.4 - 3.6 have been the same as explained by a few others on here - every second suspend/resume cycle is a complete and utter failure, without exception. It's like a clockwork. I haven't applied the patch in Comment#37, but will do as soon as 3.7 becomes "mainstream" in my distro and will post the results on here. (In reply to comment #45) > I am currently waiting for 3.7 to become mainstream for my distro (I am > using Fedora) so that I could give it a go, but I don't hold my breath to be > honest. > It will be a while until Fedora ships 3.7+ kernels by default. If you want to help test vanilla kernels there is a well maintained repo to do so. See https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories michael.weirauch: this seems to be another bug (In reply to comment #46) > It will be a while until Fedora ships 3.7+ kernels by default. If you want > to help test vanilla kernels there is a well maintained repo to do so. > > See https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories Then I am better off adopting the patch in Comment#31 to work with 3.6.10 because I am not going to turn my development machine, which I use nearly 24/7 into a test harness. (In reply to comment #38) > This patch is not directly applicable to kernels < 3.7, although porting it > is trivial. "Porting" of that patch against 3.6 is far from "trivial" - exercise in futility more like! There is no "nv50_fence.c" to start off with, "nouveau_drm" struct isn't there either, let alone that "nv10_fence_priv" doesn't seem to have "base.resume" defined at all. I think I am going to stick with 3.3 until 3.7 comes out and then decide what to do next... @Mr-4, I assumed that most of the rework went in at v3.5 regarding power management. But v3.7 had the huge overhaul everyone is talking about. So therefore I assumed that the patch might work at v3.5 and v3.6 as well. However, both kernels are EOL'ed. Sorry for the confusion and the false hope... Maybe you can send a patch to Fedora? Maybe they will happily backport it for you. Especially since they try to be bleeding edge and all... (In reply to comment #50) > Maybe you can send a patch to Fedora? Maybe they will happily backport it > for you. Especially since they try to be bleeding edge and all... I would have if I knew how to fix it and make it work, since I assumed the changes will be ... (ahem) quite "trivial". As it turns out, they aren't and my knowledge of this driver's inner workings isn't enough for me to re-create the patch for 3.6, which, incidentally, is still the latest mainstream-supported kernel by my distro. Ben Skeggs, who is responsible for the nvidia branch at Fedora (and who also appears listed in this bug) could do that - whether he chooses to is up to him, though given my past experiences with nvidia team @Fedora, I am not holding my breath. A patch referencing this bug report has been merged in Linux v3.8-rc4: commit f20ebd034eab43fd38c58b11c5bb5fb125e5f7d7 Author: Marcin Slusarz <marcin.slusarz@gmail.com> Date: Tue Dec 25 18:13:22 2012 +0100 drm/nv17-50: restore fence buffer on resume Created attachment 73421 [details] [review] Backport for 3.5 (and may be 3.6) For who is still using or interested in kernel 3.5 or 3.6, can you try/verify the attached patch? I checked and hope that it works, ended up being a simpler diff. Created attachment 73488 [details] [review] Backport for 3.5 (and may be 3.6) - fixed Sorry, the patch I posted had a small typo, this one is ok... I will try to test this weekend, cannot promise anything though. Spot on! It worked :) Just found a small window to test: - 3.5.7+patch(v2): success! 5 cycles, clean dmesg - 3.6.9+patch(v2): success! 5 cycles, clean dmesg Backporting this will make a lot of people happy. Thank you. This backported patch will crash on nv10-nv17 - priv->bo is only created for chipsets >= nv17. Created attachment 73597 [details] [review] Backport for 3.5 (and may be 3.6) - fixed again argh, thanks. This is the fixed version. (In reply to comment #58) > Created attachment 73597 [details] [review] [review] > Backport for 3.5 (and may be 3.6) - fixed again I have to say, I didn't expect for this to be fixed, but this time it looks as though this nasty bug is being squashed. I have been doing hibernate/restore like mad (a couple of times a day) for the past week and not a single glitch, not one! My one-and-only gripe with this (and it really is a minor one, given the circumstances) is that since I used a "fork" of the mainstream driver from Martin Peres (mupuf) at gitorious (http://gitorious.org/linux-nouveau-pm/linux-nouveau-pm/commits/thermal) to take advantage of the automatic fan management so that I don't have to execute "echo 40 > /sys/class/drm/card0/device/pwm0" every time I come out of hibernation and stop my nvidia fan wailing like a jet engine, this is still not implemented in the main stream driver code. I used to have this so that when I come out of hibernate the automatic fan management takes over and reduces my fan speed according to the current temperature of the video card. It worked a treat, but I don't seem to be getting this any more and applying the patch which fixed the above bug is not possible against Martin's fork, unfortunately. (In reply to comment #59) You can put the following line: "echo 40 > /sys/class/drm/card0/device/pwm0" In a file in /usr/lib/pm-utils/sleep.d/ Check out the other files on the format/arguments. Make sure you have the permissions right (root:root 700). (In reply to comment #60) > You can put the following line: > > "echo 40 > /sys/class/drm/card0/device/pwm0" > > In a file in > > /usr/lib/pm-utils/sleep.d/ Thanks Roland, I'll try that, though, ideally, I would have liked to get my automatic fan management back - it was absolutely flawless. I don't know why it was discarded from the mainstream tree. Sorry for not replying (for a long time). I'm still having this issue when resume (gives RGB-stripes/freeze). The following packages are installed from the Arch Linux repo: nouveau-dri 9.0.2-1 xf86-video-nouveau 1.0.6-1 linux 3.7.8-1 I have not tried the GIT-version in AUR. The GPU is an EVGA nVidia GeForce 450 GTS (1GB RAM). Did you apply the patch from comment #36? https://bugs.freedesktop.org/show_bug.cgi?id=50121#c36 Only v3.8 carries the fix. (In reply to comment #63) > Only v3.8 carries the fix. Nope! My own distro (Fedora, as it turns out) had this patch already in their main nv tree as of kernel 3.7.4, so there was no need for me to apply it separately - I just built the kernel and had zero problems with my nv card ever since. (In reply to comment #44) I am experiencing the same problem on my w520 with Nvidia GF106 [Quadro 2000M].. I get the same symptoms / messages. Did you find a fix for the problem? I noticed that suspend / resume works if using libdrm-nouveau1a only (without libdrm-nouveau2), however I the mouse gets lost on resume. If using libdrm-nouveau2, I get the problems.. Maybe the problem is somehow related to dri2? > Am I still in the scope of this bug with the following endlessly repeating > errors on resume? (Which will go away if killing X.) > > [ 1623.833303] nouveau E[ PGRAPH][0000:01:00.0] TRAP ch 1 [0x007fe00000 > Xorg[1009]] > [ 1623.833311] nouveau E[ PGRAPH][0000:01:00.0] SHADER 0xa004021e (In reply to comment #65) > (In reply to comment #44) > I am experiencing the same problem on my w520 with Nvidia GF106 [Quadro > 2000M].. I get the same symptoms / messages. Did you find a fix for the > problem? > > I noticed that suspend / resume works if using libdrm-nouveau1a only > (without libdrm-nouveau2), however I the mouse gets lost on resume. If using > libdrm-nouveau2, I get the problems.. Maybe the problem is somehow related > to dri2? > > > Am I still in the scope of this bug with the following endlessly repeating > > errors on resume? (Which will go away if killing X.) > > > > [ 1623.833303] nouveau E[ PGRAPH][0000:01:00.0] TRAP ch 1 [0x007fe00000 > > Xorg[1009]] > > [ 1623.833311] nouveau E[ PGRAPH][0000:01:00.0] SHADER 0xa004021e Very interessting comments. Please attach this info again to bug 59168 where the W520 lovers hang around regarding this issue. Here we go again... After experiencing the joy and the meaning of "stability" and what it feels like, I decided to upgrade to kernel 3.7.9 (the latest stable) from 3.7.4 and voila... this bug reared its ugly head again after the first hibernate/restore cycle. This is what I get: kernel: PM: Syncing filesystems ... done. kernel: Freezing user space processes ... (elapsed 0.01 seconds) done. kernel: PM: Preallocating image memory... done (allocated 220534 pages) kernel: PM: Allocated 882136 kbytes in 0.25 seconds (3528.54 MB/s) kernel: Freezing remaining freezable tasks ... (elapsed 3.29 seconds) done. kernel: Suspending console(s) (use no_console_suspend to debug) kernel: sd 2:0:0:0: [sda] Synchronizing SCSI cache kernel: i8042 kbd 00:0a: wake-up capability enabled by ACPI kernel: mpu401 00:05: disabled kernel: nouveau [ DRM] suspending fbcon... kernel: nouveau [ DRM] suspending display... kernel: nouveau [ DRM] unpinning framebuffer(s)... kernel: nouveau [ DRM] evicting buffers... kernel: pciehp 0000:00:02.0:pcie04: pciehp_suspend ENTRY kernel: agpgart-via 0000:00:00.0: Refused to change power state, currently in D0 kernel: pci 0000:00:13.1: wake-up capability enabled by ACPI kernel: nouveau [ DRM] suspending client object trees... kernel: PM: freeze of devices complete after 343.943 msecs kernel: PM: late freeze of devices complete after 0.310 msecs kernel: PM: noirq freeze of devices complete after 0.488 msecs kernel: ACPI: Preparing to enter system sleep state S4 kernel: PM: Saving platform NVS memory kernel: Disabling non-boot CPUs ... kernel: Broke affinity for irq 1 kernel: Broke affinity for irq 16 kernel: smpboot: CPU 1 is now offline kernel: PM: Creating hibernation image: kernel: PM: Need to copy 169979 pages kernel: PM: Restoring platform NVS memory kernel: Enabling non-boot CPUs ... kernel: smpboot: Booting Node 0 Processor 1 APIC 0x1 kernel: CPU1 is up kernel: ACPI: Waking up from system sleep state S4 kernel: PM: noirq restore of devices complete after 22.507 msecs kernel: PM: early restore of devices complete after 0.121 msecs kernel: pciehp 0000:00:02.0:pcie04: pciehp_resume ENTRY kernel: nouveau [ DRM] re-enabling device... kernel: nouveau [ DRM] resuming client object trees... kernel: nouveau [ VBIOS][0000:01:00.0] running init tables kernel: pci 0000:00:13.1: wake-up capability disabled by ACPI kernel: mpu401 00:05: activated kernel: i8042 kbd 00:0a: wake-up capability disabled by ACPI kernel: agpgart-via 0000:00:00.0: AGP 3.5 bridge kernel: agpgart: kworker/u:1 tried to set rate=x12. Setting to AGP3 x8 mode. kernel: agpgart-via 0000:00:00.0: putting AGP V3 device into 8x mode kernel: nouveau 0000:01:00.0: putting AGP V3 device into 8x mode kernel: nouveau [ DRM] resuming display... kernel: nouveau [ PFIFO][0000:01:00.0] unknown intr 0x00010000, ch 0 [...ad infinitum...] kernel: nouveau [ PFIFO][0000:01:00.0] still angry after 101 spins, halt kernel: nouveau [ DRM] 0xD3FB: Parsing digital output script table kernel: nouveau [ DRM] Setting dpms mode 3 on TV encoder (output 3) kernel: nouveau [ DRM] 0xD3FB: Parsing digital output script table abrt[3701]: not dumping repeating crash in '/usr/bin/Xorg' kernel: nouveau E[ 3644] failed to idle channel 0xcccc0000 abrt[3716]: saved core dump of pid 3710 (/usr/bin/Xorg) to /var/spool/abrt/ccpp-1361815666-3710.new/coredump (3674112 bytes) abrtd: Directory 'ccpp-1361815666-3710' creation detected abrtd: Crash is in database already (dup of /var/spool/abrt/ccpp-1361815646-2078) abrtd: Deleting crash ccpp-1361815666-3710 (dup of ccpp-1361815646-2078), sending dbus signal kernel: nouveau E[ 3710] failed to idle channel 0xcccc0000 abrt[3726]: not dumping repeating crash in '/usr/bin/Xorg' kernel: nouveau E[ 3720] failed to idle channel 0xcccc0000 [...the last two messages repeat ad-infinitum...] So, I think that whatever was done in the nvidia kernel code after 3.7.4 (I haven't tested with version prior to 3.7.9) brought that nasty bug back to life and I think it should be reopened, seeing that I am not the only one experiencing this... Is it consistent, git says 'no': ronald@Alpha /usr/src/linux :) $ git log --oneline --no-merges v3.7.4...v3.7.9 -- drivers/gpu/drm 330b8ab drm/nouveau: add lockdep annotations c7fc196 drm/radeon: Calling object_unrefer() when creating fb failure b666341 drm/radeon: prevent crash in the ring space allocation 3e15c0b drm/radeon: protect against div by 0 in backend setup f69c00e drm/radeon: fix backend map setup on 1 RB sumo boards cb23301 drm/radeon: fix MC blackout on evergreen+ 04589e1 drm/radeon: add quirk for RV100 board 4e370f5 drm/radeon: add WAIT_UNTIL to the non-VM safe regs list for cayman/TN 83f83a0 drm/radeon/evergreen+: wait for the MC to settle after MC blackout ec8a7ca drm/i915: fix FORCEWAKE posting reads 710c8f5 drm/radeon: fix a rare case of double kfree c140fec drm/radeon: fix error path in kpage allocation 72b56e8 efi: Make 'efi_enabled' a function to query EFI facilities 8a9d24b drm/i915: dump UTS_RELEASE into the error_state 36ce28c drm/i915: GFX_MODE Flush TLB Invalidate Mode must be '1' for scanline waits 924782c drm/i915: Disable AsyncFlip performance optimisations 18c8e49 radeon_display: Use pointer return error codes 45bced1 drm/radeon: fix cursor corruption on DCE6 and newer dd1b4df drm/i915: Implement WaDisableHiZPlanesWhenMSAAEnabled 5afeb70 drm/i915: Invalidate the relocation presumed_offsets along the slow path Don't think the lockdep annotations do anything that changes the code. But you might want to give it a try... Whoops, typo in first sentence. What I meant was: Is the crash consistent? Does it happen every time? The fix should be applied at v3.7.3: ronald@Alpha /usr/src/linux :) $ git log --oneline --no-merges v3.7^...v3.7.9 -- drivers/gpu/drm/nouveau Makefile 5b7be63 Linux 3.7.9 7773647 Linux 3.7.8 330b8ab drm/nouveau: add lockdep annotations 89c5f13 Linux 3.7.7 07c4ee0 Linux 3.7.6 13280f4 Linux 3.7.5 8380f1a arm64: makefile: fix uname munging when setting ARCH on native machine 8a69ca2 Linux 3.7.4 078314e Linux 3.7.3 39c5cda drm/nvc0/fb: fix crash when different mutex is used to protect same list c6f94c3 drm/nouveau/clock: fix support for more than 2 monitors on nve0 7fbc316 drm/nouveau: add locking around instobj list operations 4d64387 drm/nouveau: fix blank LVDS screen regression on pre-nv50 cards ad30b29 drm/nv17-50: restore fence buffer on resume 2370a21 drm/prime: drop reference on imported dma-buf come from gem cfdfb8f drm/nouveau: fix init with agpgart-uninorth a77af8b kbuild: Do not remove vmlinux when cleaning external module e6577f3 Linux 3.7.2 cc86050 Linux 3.7.1 2959440 Linux 3.7 Yes, I'm showing off my newly attained git skillz =P (In reply to comment #69) > Is the crash consistent? Does it happen every time? The fix should be > applied at v3.7.3: No. I've downgraded the kernel slightly - from 3.7.9-201 to 3.7.9-101 (both official releases from Fedora) and after the first hibernate/restore cycle I get: nouveau [ PFIFO][0000:01:00.0] DMA_PUSHER - Ch 0 Get 0x00000004 Put 0x000000c0 State 0x80000720 (err: INVALID_CMD) Push 0x00000000 However, the restore seems to work OK - I have no side effects after it is done, apart from the above message, so this might be an indication on what goes wrong - don't know, but thought to post it here. You should test with a vanilla kernel to exclude distro tampering of the kernel. Furthermore, maybe test if the above patch introduces this regression? Whatever you do, I suggest you open a new bug for this to get some more exposure. Things have been pretty stable with 3.8.1, but it is getting a bit "twitchy" with 3.8.7 - on two separate occasions I've got the following error on restore: kernel: nouveau [ DRM] re-enabling device... kernel: nouveau [ DRM] resuming client object trees... kernel: nouveau [ VBIOS][0000:01:00.0] running init tables kernel: nouveau [ DRM] resuming display... kernel: nouveau W[ PFIFO][0000:01:00.0] unknown intr 0x00010000, ch 0 kernel: nouveau W[ PFIFO][0000:01:00.0] unknown intr 0x00010000, ch 0 kernel: nouveau W[ PFIFO][0000:01:00.0] unknown intr 0x00010000, ch 0 [...] kernel: nouveau E[ PFIFO][0000:01:00.0] still angry after 101 spins, halt kernel: nouveau [ DRM] 0xD3FB: Parsing digital output script table kernel: nouveau [ DRM] Setting dpms mode 3 on TV encoder (output 3) kernel: nouveau [ DRM] 0xD3FB: Parsing digital output script table at which point the whole screen starts with the now-familiar black-and-white rectangles and I have no option, but to reboot. The above error does not happen very often - maybe every 8-10 hibernate/resume cycles, but with the 3.8.1. kernel I didn't have any errors at all. Assuming it's nouveau specific and not drm: 2705de0 drm/nouveau: fix handling empty channel list in ioctl's is between 3.8.6 and 3.8.7. However, there are some generic DRM changes as well: faec22f KMS: fix EDID detailed timing frame rate eaa1a61 KMS: fix EDID detailed timing vsync parsing To be honest, my card is wonky too. Sometimes the screen goes black and it oopses (I think). But I think some of it is also old age... |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.