Bug 50121

Summary: [Regression] Since kernel v3.5 several cards fail to resume, introduced by: 'convert to exec engine, and improve channel sync'
Product: xorg Reporter: Ronald <ronald645>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: florian, francois5537, gibboris, mr.dash.four
Version: gitKeywords: regression
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Dmesg log of bad resume
none
Dmesg output of correct resume
none
Dmesg boot log of the 'bad' kernel
none
Dmesg resume log of the 'bad' kernel
none
Dmesg boot log of the 'good' kernel
none
Dmesg resume log of the 'good' kernel
none
W520-4276CTO-NVC3 dmesg commitish-872dcac gdm + suspend/resume cycle
none
Dmesg of boot
none
Dmesg of succesful first resume
none
Dmesg of failed second resume
none
v3.7-rc4 + nouveau patchlist
none
Diff of both resume sessions
none
Reset agp.stat to unknown during suspend
none
Dmesg of v3.8rc1 + nouveau HEAD
none
fix
none
fix v2
none
W520-4276CTO-NVC3 dmesg 3.8.0-rc1 suspend/resumce cycle screen garbage
none
Backport for 3.5 (and may be 3.6)
none
Backport for 3.5 (and may be 3.6) - fixed
none
Backport for 3.5 (and may be 3.6) - fixed again none

Description Ronald 2012-05-19 09:33:58 UTC
Created attachment 61851 [details]
Dmesg log of bad resume

The following commit:

Author: Ben Skeggs <bskeggs@redhat.com>
Date:   Mon Apr 30 13:55:29 2012 +1000

    drm/nouveau/fence: convert to exec engine, and improve channel sync
    
    Now have a somewhat simpler semaphore sync implementation for nv17:nv84,
    and a switched to using semaphores as fences on nv84+ and making use of
    the hardware's >= acquire operation.

Probably causes the FX5200 to fail to resume. I'm not 100% completely sure, since reverting this commit on top of head is not without conflicts since the revent cleanups and reworks. I cannot proceed without help. If someone could generate a revert of this commit on top of current HEAD, I will then be able to confirm this is the exact bad commit.

Further down below you will only find computer generated output (in this order):

- The git bisect log
- The output of lspci for the nvidia card
- A list of currently used userspace packages (with version) from https://launchpad.net/~xorg-edgers/+archive/ppa:
- Attached: The file nouveau.bad.resume.txt shows dmesg output of a bad resume (commit 1d226cc142b).
- Attached: The file nouveau.good.resume.txt shows dmesg output of a good resume (commit 5d720f245).

gebruiker@Delta:~/Documenten/Ronald/linux-git$ git bisect log
git bisect start
# bad: [0e29f737548c749482371ba307a6de15ae2c1956] drm/nouveau: make engine subclass subdev, and noaccel a bitfield
git bisect bad 0e29f737548c749482371ba307a6de15ae2c1956
# good: [9eb608d0091c11e5712b421c8d3c7cec8950d14e] drm/nv04/disp: disable vblank interrupts when disabling display
git bisect good 9eb608d0091c11e5712b421c8d3c7cec8950d14e
# bad: [da9472c2db711fb589e248566ea17288d3c993e5] drm/nv04/software: fix engine creation
git bisect bad da9472c2db711fb589e248566ea17288d3c993e5
# bad: [f903665be55c7d347bcfd684745026af30439a8d] drm/nv50: remove execution engine context saves on suspend
git bisect bad f903665be55c7d347bcfd684745026af30439a8d
# bad: [da495ac412f6a70185305facd756f0f04fb5fd3b] drm/nouveau: fix engine context destructor ordering
git bisect bad da495ac412f6a70185305facd756f0f04fb5fd3b
# good: [11d9712f6d91203bd3f34ef2cebf1fd188e73756] drm/nouveau: move flip-related channel setup to software engine
git bisect good 11d9712f6d91203bd3f34ef2cebf1fd188e73756
# good: [1d226cc142b4e504150b9d5455545720fbde6f1f] drm/nouveau/fence: minor api changes for an upcoming rework
git bisect good 1d226cc142b4e504150b9d5455545720fbde6f1f
# bad: [5d720f24505c3fb6b4740fbf5b6e99839de2fbd9] drm/nouveau/fence: convert to exec engine, and improve channel sync
git bisect bad 5d720f24505c3fb6b4740fbf5b6e99839de2fbd9

root@Delta:/var/log# lspci -s 01:00.0 -vvvv -nnnn
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation NV34 [GeForce FX 5200] [10de:0322] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ASUSTeK Computer Inc. Device [1043:80df]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32 (1250ns min, 250ns max)
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at de000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at d0000000 (32-bit, prefetchable) [size=128M]
	Expansion ROM at dfee0000 [disabled] [size=128K]
	Capabilities: [60] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [44] AGP version 3.0
		Status: RQ=32 Iso- ArqSz=0 Cal=3 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3+ Rate=x4,x8
		Command: RQ=32 ArqSz=0 Cal=0 SBA+ AGP+ GART64- 64bit- FW+ Rate=x8
	Kernel driver in use: nouveau

Xorg-edgers packages used relevant for this bug:
- libdrm
2.4.34+git20120512.e07b6506-0ubuntu0ricotz~precise
- libpciaccess
0.13-1~precise2
- mesa
8.1~git20120512.f9654084-0ubuntu0ricotz~precise
- xorg-server
2:1.12.1.901+git20120510+server-1.12-branch.58dfb139-0ubuntu0ricotz~precise
- xserver-xorg-video-nouveau
1:0.0.16+git20120509.58156446-0ubuntu0sarvatt~precise
Comment 1 Ronald 2012-05-19 09:34:35 UTC
Created attachment 61852 [details]
Dmesg output of correct resume
Comment 2 Ronald 2012-05-19 09:40:47 UTC
On a small sidenote, compiling current head with

!CONFIG_DRM_NOUVEAU_BACKLIGHT

results in:

http://pastebin.com/8TvZNDqz

Current head means:

1. git pull git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
2. git pull git://anongit.freedesktop.org/nouveau/linux-2.6
Comment 3 Ronald 2012-05-19 12:27:58 UTC
Through googling I found out that drm has it's own debug facility (duh...). I have added 4 files that are generated with the following kernel boot parameters:

drm.debug=15 log_buf_len=32M

The 'bad' state contains the bootlog and suspend log of the kernel at commit 5d720f2450 from the nouveau/linux-2.6 tree.

The 'good' state contains the bootlog and suspend log of the kernel at the commit before 5d720f2450 fron the nouveau/linux-2.6 tree. (Which is the default kernel for now.)

I'm out of idea's right now, so any pointers would be great.
Comment 4 Ronald 2012-05-19 12:28:40 UTC
Created attachment 61858 [details]
Dmesg boot log of the 'bad' kernel
Comment 5 Ronald 2012-05-19 12:28:57 UTC
Created attachment 61859 [details]
Dmesg resume log of the 'bad' kernel
Comment 6 Ronald 2012-05-19 12:29:19 UTC
Created attachment 61860 [details]
Dmesg boot log of the 'good' kernel
Comment 7 Ronald 2012-05-19 12:29:45 UTC
Created attachment 61861 [details]
Dmesg resume log of the 'good' kernel
Comment 8 Ronald 2012-06-08 04:31:58 UTC
I tried kernel 3.5-rc1, because it seems that some recent patches were left out during the merge. I was hoping that this newer kernel would somehow solve this. However, I'm still having the same issue's in this release.

On a further note, the commit before 5d720f2450 was a pretty good release for this card. Glxgears reported almost 400fps and window resizing was smooth. With v3.4.1 or v3.5-rc1, this performance regressed to it's original state. Glxgears reports 100ftps and resizing is choppy again.
Comment 9 michael.weirauch 2012-08-22 06:53:37 UTC
Now that I have found out the root cause of my resume trouble (documented in 53101) I found this bug.

Here's the info so far: (replicating from bug 53101)

ThinkPad W520 4276CTO NVC3 (2000M)
openSUSE 12.2 + nouveau 20120813 872dcac

* Booting works (nox2apic, W520 ACPI table issue)
* gdm has graphics distortions though (see early dmesg excerpt)
* double ctrl+alt+backspace "fixes" this and gdm looks good
* suspend from gnome-shell 3.4.2 works
* resume shows gdm-password prompt and usually a white-noise background
** the gnome-shellish top-panel looks intact, though
** mouse cursor not movable, cpu load
** looks like "something" tries to restart gdm/X over and over again
* switching to vt possible with some insisting
* restarting gdm does lock up the system
* the "channel x kick timeout" seems new since some commits IIRC

repeatedly in dmesg:
[  156.925301] nouveau E[   PFIFO][0000:01:00.0] playlist update failed
[  159.924800] nouveau E[     DRM][0000:01:00.0] failed to idle channel
0xcccc0000
[  161.924690] nouveau E[   PFIFO][0000:01:00.0] channel 1 kick timeout
[  161.924787] nouveau  [   PFIFO][0000:01:00.0] unknown status 0x00000100
[  163.924603] nouveau E[   PFIFO][0000:01:00.0] playlist update failed
[  163.989722] nouveau  [   PFIFO][0000:01:00.0] unknown status 0x00000100
[  165.989535] nouveau E[   PFIFO][0000:01:00.0] channel 3 kick timeout
[  165.989670] nouveau  [   PFIFO][0000:01:00.0] unknown status 0x00000100
[  167.989455] nouveau E[   PFIFO][0000:01:00.0] playlist update failed
[  167.989517] nouveau ![   PFIFO][0000:01:00.0] unhandled status 0x00000001
[  170.649537] nouveau E[   PFIFO][0000:01:00.0] playlist update failed
[  172.660200] nouveau E[   PFIFO][0000:01:00.0] playlist update failed
[  185.103713] nouveau E[     DRM][0000:01:00.0] failed to idle channel
0xcccc0001
[  187.103627] nouveau E[   PFIFO][0000:01:00.0] channel 2 kick timeout

I tried a fc17 install and the original kernel (3.3.4-5.fc17.x86_64) worked.
Suspend/resume fine at least when not in docking station. After updating that
test install to 3.5.1-1.fc17.x86_64 the same issues cropped up I see in
openSUSE 12.2. So this looks distribution agnostic.

--
Bisection rounds testing successful suspend/resume cycles on NVC3/2000M:

note:
* gdm greeter is showing garbage (screen content from before reboot) somewhere
before the last known good commits
** this issue was ignored and still present in the last good commit but is not
the topic of this bug

$ git bisect log
# bad: [f9b495fca46836a6a05cedde8058ccb8a3e62c3d] drm/nouveau: use
ioread32_native/iowrite32_native for fifo control registers
# good: [f887c425f9eeed8ffbca64c8be45da62b07096c0] drm/nouveau: bump version to
1.0.0
git bisect start 'HEAD' 'f887c425f9eeed8ffbca64c8be45da62b07096c0' '--'
'drivers/gpu/drm/nouveau/'
# bad: [9bd0c15fcfb42f6245447c53347d65ad9e72080b] drm/nouveau/fbcon: using
nv_two_heads is not a good idea
git bisect bad 9bd0c15fcfb42f6245447c53347d65ad9e72080b
# good: [5132f37700210740117f5163b5df7aa1c8469a55] drm/nve0/fifo: initial
implementation
git bisect good 5132f37700210740117f5163b5df7aa1c8469a55
# bad: [71af5e62db5d7d6348e838d0f79533653e2f8cfe] drm/nv50/gr: make sure
NEXT_TO_CURRENT is executed even if nothing done
git bisect bad 71af5e62db5d7d6348e838d0f79533653e2f8cfe
# good: [afada5e0bb3cac8530c2ae36aa0abca41d60e063] drm/nv04/disp: disable
vblank interrupts when disabling display
git bisect good afada5e0bb3cac8530c2ae36aa0abca41d60e063
# bad: [5e120f6e4b3f35b741c5445dfc755f50128c3c44] drm/nouveau/fence: convert to
exec engine, and improve channel sync
git bisect bad 5e120f6e4b3f35b741c5445dfc755f50128c3c44
# good: [35bcf5d55540e47091a67e5962f12b88d51d7131] drm/nouveau: move
flip-related channel setup to software engine
git bisect good 35bcf5d55540e47091a67e5962f12b88d51d7131
# good: [d375e7d56dffa564a6c337d2ed3217fb94826100] drm/nouveau/fence: minor api
changes for an upcoming rework
git bisect good d375e7d56dffa564a6c337d2ed3217fb94826100


5e120f6e4b3f35b741c5445dfc755f50128c3c44 is the first bad commit
commit 5e120f6e4b3f35b741c5445dfc755f50128c3c44
Author: Ben Skeggs <bskeggs@redhat.com>
Date:   Mon Apr 30 13:55:29 2012 +1000

    drm/nouveau/fence: convert to exec engine, and improve channel sync

    Now have a somewhat simpler semaphore sync implementation for nv17:nv84,
    and a switched to using semaphores as fences on nv84+ and making use of
    the hardware's >= acquire operation.

    Signed-off-by: Ben Skeggs <bskeggs@redhat.com>

:040000 040000 8f2ca4ddf4969c75f688a96fdb152e449fda4852
da67a1bd8d608577e659a26715cf8af3644d8efe M    drivers

--

@Ronald
* can it be the ticket subject carries the wrong commitish? (because the bad commit we both identified is "5e120f6e4b3f35b741c5445dfc755f50128c3c44" in the nouveau/linux-2.6 tree)
* if you like, extend the subject with NVC3/Quadro 2000M
Comment 10 michael.weirauch 2012-08-22 06:54:32 UTC
Created attachment 65931 [details]
W520-4276CTO-NVC3 dmesg commitish-872dcac gdm + suspend/resume cycle
Comment 11 Ronald 2012-08-22 17:33:50 UTC
Thanks for your info and a 'me2'. I updated the title as you requested. I placed the entire commit title.

Everytime a new kernel is released, the nouveau project does a 'git rebase' which essentially wipes all patches outside of Linus' tree and places them back one by one. This implies a change in the SHA hash since git treats those patches that are being reapplied (but are essentially the same) and new (and thus different commits).
Comment 12 Ronald 2012-08-22 17:48:32 UTC
One a sidenote (just to summarize/confirm this bug), it seems that:
- symptoms are the exact same
-  regression fails
-  cursor invisible
-  screen is corrupted
-  no visual response
-  vt switching works with some persistence
-  logs are flooded (with generic messages)
- bisected patch is the exact same (as in title)
-  however, as of now, based on different kernel versions (v3.4 vs v3.5)

However, there is one nitpick. I have pinned this computer's kernel on 3.4 since 3.5-rc0 and higher suffer from another regression which I have failed to bisect (I forgot why/how that did not work).

When I have time this weekend, I will see how far I can get with Linus' tree and then combine with the nouveau tree and we will see what happens. A refresh of testdata might be good since the patches from the nouveau tree allow setting debug levels which is nice.
Comment 13 michael.weirauch 2012-08-23 05:46:04 UTC
Ah, thanks for clearing that up with the rebased commits. Didn't know they'd get a new hash then.

Apart from providing more verbose logs, I am unsure what else one can do here. I think a dev (Ben?) should get us into some direction. The bisected bad commit is a tad bit too huge to understand/revert/trial-and-error-tinker for me ;)
Comment 14 francoism 2012-10-10 12:22:00 UTC
*** Bug 55744 has been marked as a duplicate of this bug. ***
Comment 15 francoism 2012-10-10 12:23:24 UTC
Same issue here with my EVGA GeForce 450 GTS (1GB).

It works fine with the nVidia drivers.

Kernel: 3.6.1
nouveau-git & libdrm-git
OS: Archlinux
Comment 16 Ronald 2012-10-19 21:46:24 UTC
Francois,

You can you confirm that your regression is also caused by the following commit?

drm/nouveau/fence: convert to exec engine, and improve channel sync

Michael and my problems are not there when we build a kernel before this patch is applied.

Please test to make sure that you are not having a seperate problem. Thank you.
Comment 17 Ronald 2012-11-05 11:42:07 UTC
Still not working with 3.7-rc4:


nouveau  [     DRM] re-enabling device...
nouveau  [     DRM] resuming client object trees...
nouveau  [   VBIOS][0000:01:00.0] running init tables
nouveau W[  PTIMER][0000:01:00.0] unknown input clock freq
agpgart-via 0000:00:00.0: AGP 3.5 bridge
agpgart: kworker/u:30 tried to set rate=x12. Setting to AGP3 x8 mode.
agpgart-via 0000:00:00.0: putting AGP V3 device into 8x mode
nouveau 0000:01:00.0: putting AGP V3 device into 8x mode
nouveau  [     DRM] Loading NV17 power sequencing microcode
nouveau  [     DRM] resuming display...
nouveau  [     DRM] Setting dpms mode 3 on TV encoder (output 1)
Restarting tasks ... done.
nouveau E[     DRM] reloc wait_idle failed: -16
nouveau E[     DRM] reloc apply: -16
nouveau E[     DRM] reloc wait_idle failed: -16
nouveau E[     DRM] reloc apply: -16
nouveau E[     DRM] reloc wait_idle failed: -16
nouveau E[     DRM] reloc apply: -16

Screen stays black on resume. I tested this +nouveau tree which enables z-compression.
Comment 18 Ronald 2012-11-06 10:10:40 UTC
Linux kernel 3.7-rc4 + nouveau tree (see nouveau.txt for patches):

CONFIG_LOG_BUF_SHIFT=21
CONFIG_NOUVEAU_DEBUG=6
CONFIG_NOUVEAU_DEBUG_DEFAULT=6 # 7 gave a >16MB log

Did a 'dmesg -c' for each seperate logfile just before doing 'pm-hibernate'. Logfiles are from the same boot. First resume succeeds, second hangs just like the first occurence of this bug.

Attaching files...
Comment 19 Ronald 2012-11-06 10:11:45 UTC
Created attachment 69606 [details]
Dmesg of boot
Comment 20 Ronald 2012-11-06 10:12:24 UTC
Created attachment 69607 [details]
Dmesg of succesful first resume
Comment 21 Ronald 2012-11-06 10:12:48 UTC
Created attachment 69608 [details]
Dmesg of failed second resume
Comment 22 Ronald 2012-11-06 10:13:25 UTC
Created attachment 69609 [details]
v3.7-rc4 + nouveau patchlist
Comment 23 Ronald 2012-11-06 10:21:13 UTC
I'm also attaching a diff of both logs. I filtered them using sed to get rid of the timestamps.

One thing stands out (to me):

The logs of the correct resume mention this:

-ACPI: Invalid Power Resource to register!
-agpgart-via 0000:00:00.0: Refused to change power state, currently in D0

The rest of the diff contains errors from the bad resume and subsequent trace thereof.
Comment 24 Ronald 2012-11-06 10:21:40 UTC
Created attachment 69612 [details]
Diff of both resume sessions
Comment 25 Emil Velikov 2012-11-06 12:41:10 UTC
Created attachment 69616 [details] [review]
Reset agp.stat to unknown during suspend

Something to test :P
Comment 26 Ronald 2012-11-06 13:07:09 UTC
No dice =) Even hangs at suspend sometimes. Besides, this bug is not slot specific. Someone else with a 'NVC3/Quadro 2000M' (PCI-E) also has these problems since the mentioned commit:

'drm/nouveau/fence: convert to exec engine, and improve channel sync'.
Comment 27 Ronald 2012-12-24 08:36:25 UTC
Just tested v3.8-rc1 + nouveau @ 'drm/nouveau/fan: handle the cases where we are outside of the linear zone' on this FX5200.

Especially 'drm/nouveau/bios: cache ramcfg strap on later chipsets' looked promising (even though I don't have a recent chipset... there was hope...).

Things did not regress further. First suspend works most of the time, second suspend crashes. Symptoms did change a bit. When resume eventually completes, I do not see a frozen/unresponsive desktop but the screen goes black with red/green (I don't think i saw any blue) pixels scattered all over it. Density was really low, so maybe 200 pixels on the entire screen? These pixels did not move or alterate or anything, it was pretty static and boring actually.

But I recommend Michael and francoism to test latest HEAD as the mentioned commit might fix your issue's???

I will add a dmesg from v3.8rc1 kernel for your viewing convenience. This is a dmesg from syslog, not from the 'dmesg' command!
Comment 28 Ronald 2012-12-24 08:37:11 UTC
Created attachment 72060 [details]
Dmesg of v3.8rc1 + nouveau HEAD
Comment 29 Raphaël Droz 2012-12-24 13:23:06 UTC
referencing bug #55258 here
Comment 30 Ronald 2012-12-24 15:03:53 UTC
Kewl, that means that this regression is spread across several cards if you purely look at the symptoms.

- NV40 generation card (0x049200a2)
- twice on an NV34 [GeForce FX 5200] (IRC: FX5200 is an 'odd beast' :/)
- NVC3 (2000M)
- EVGA GeForce 450 GTS

From two of these, the regression has been confirmed to be resulting from commit:

'drm/nouveau/fence: convert to exec engine, and improve channel sync'

as part of the nouveau rework that got upstreamed in kernel 3.5:

- NV34 [GeForce FX 5200] by me
- NVC3 (2000M) by Michael Weirauch

I think it's safe to assume that ppl who are experiencing this regression as well using a kernel on 3.5 or above are suffering from the same issue.

Francoism is using 3.6, so that is okay.

@Raphaël Droz: What kernel version are you using? And maybe the kernel version from Mr-4 might be useful as well. Just to confirm.
Comment 31 Marcin Slusarz 2012-12-25 00:31:56 UTC
Created attachment 72085 [details] [review]
fix

does this patch fix it?

nvc3 should be already fixed in 3.7
Comment 32 Ronald 2012-12-25 09:25:47 UTC
Seems to have fixed the bug. It has survived three consecutive suspend and resume cycles in a row. This was not possible after v3.5 without your supplied fix.

So it seems that the direct symptoms have been resolved with this patch. But if you don't mind, I would like to test it somewhat further for these days.

And I suggest that everyone else with this hardware (Raphael and Mr-4) to test as well.

Guess I won't be telling my parents they are running an rc1 kernel hehe...
Comment 33 Ronald 2012-12-25 11:17:33 UTC
The FX5200 still functions and survived another cycle.

I went ahead and tested your patch on a 'G73 (NV4B)' as well. Since modeset became mandatory suspend/resume never worked (anymore). It suspended, but resumed with a garbled console where some colors are off. Eventually, the card looks up.

Recent rework made the screen completely black. So the regression regressed even further. Current head with your patch restores the second regression back to where the display is garbled after resume (after which it looks up again).

So I'm fairly confident that your patch restores pre rework (v3.5) behaviour on these cards with respect to suspend and resume cycles.

I'm going to test an NV4E laptop later, when I have time.
Comment 34 Ronald 2012-12-25 12:24:08 UTC
Just tested v3.8rc1+nouveau+patch on an NV4E. Suspend and resume works! Though it takes about 1 minute before it shuts down (hangs somewhere). And maybe 1 or 2 minutes to resume (before screen is restored).

Restarting is faster, but this looks promising. At least it works.

Anyways, no regressions, just improvements.

Thanks Santa!
Comment 35 Marcin Slusarz 2012-12-25 16:35:08 UTC
Good. Please open new bug reports for the other issues you have with these (4B & 4E) boxes.
Comment 36 Marcin Slusarz 2012-12-25 17:58:25 UTC
Created attachment 72113 [details] [review]
fix v2

I simplified the fix a bit. Can you test it too? (Just one box would be enough)

(Note that you don't need to run it on 3.8-rc1 - it's applicable to 3.7 too)
Comment 37 Ronald 2012-12-25 20:07:18 UTC
Reverted the old patch and applied the new one. Did this on the same v3.8rc1+nouveau kernel tree. It only rebuilded the nouveau specific files.

The FX5200 survived 4 consecutive cycles with this patch. It seems to exhibit the same behaviour as the previous patch: During suspend and resume, for a short moment, there is a blue/green stray pixel in the top left corner. Everything works fine though.

Good to go if you ask me. Maybe apply it to 3.5 and 3.6 as well??

As for the other cards I mentioned (NV4E + NV4B) I merely referenced them to test the patch against regressions. Suspend is broken on NV4B since KMS became mandatory. And it seems to work somewhat on NV4E.

*My primary aim is to guard against regressions.* FWIW, do you really want seperate bugs for the NV4E and NV4B??? They are just bugs, not regressions.
Comment 38 Marcin Slusarz 2012-12-25 20:28:50 UTC
This patch is not directly applicable to kernels < 3.7, although porting it is trivial.

WRT other bugs: Yes, I would like to see separate bug reports. Suspend/resume should work for all cards and not take too much time. I don't promise fixing them, of course.
Comment 39 Ronald 2012-12-26 11:31:57 UTC
Ok, I will file seperate bugs for these cards.

FYI, the FX5200 survived 11 cycles so far without a hitch. Thanks a lot =)
Comment 40 Ronald 2012-12-26 11:38:09 UTC
Marcin, I already filed bug #23223 for the NV4B, I'll update that one. It's quite old, but I'll continue from there. I'll open a new one for the NV4E.
Comment 41 Ronald 2012-12-26 12:28:47 UTC
I'm waiting on filing a bug for the NV4E. It seems that nouveau works properly, I don't see any delays or errors at all in dmesg. It is whining about the b43 firmware, but even if I unload the driver the resume still takes 1m and 30 seconds. But it works.

I'm not sure this is a nouveau bug, hence I'm not filing it. I will dig deeper when I have time.
Comment 42 Raphaël Droz 2012-12-26 17:11:40 UTC
perfect!
the last patch did the trick (tested with vanilla 3.7).

I hibernated several times, even with a running instance of glxgears.
No problem anymore.

On my side the startup time is not especially long.

thanks
Comment 43 Marcin Slusarz 2012-12-26 19:39:58 UTC
*** Bug 55258 has been marked as a duplicate of this bug. ***
Comment 44 michael.weirauch 2013-01-04 08:57:05 UTC
Created attachment 72499 [details]
W520-4276CTO-NVC3 dmesg 3.8.0-rc1 suspend/resumce cycle screen garbage

Tested NVC3 3.7.1 openSUSE kernel and nouveau 3.8.0-rc1 as of 2013-01-03. Both show same screen distortion (snowy background, button background and window border ok) on gdm when resuming. Though the system is "usable" load wise and doesn't lock-up.

Am I still in the scope of this bug with the following endlessly repeating errors on resume? (Which will go away if killing X.)

[ 1623.833303] nouveau E[  PGRAPH][0000:01:00.0] TRAP ch 1 [0x007fe00000 Xorg[1009]]
[ 1623.833311] nouveau E[  PGRAPH][0000:01:00.0] SHADER 0xa004021e
Comment 45 Mr-4 2013-01-04 10:17:46 UTC
(In reply to comment #30)
> @Raphaël Droz: What kernel version are you using? And maybe the kernel
> version from Mr-4 might be useful as well. Just to confirm.
Apologies for this late reply, but since I first started experiencing these problems, and with no end in sight, I just got a bit fed up with it and was forced to downgrade my kernel - I am sticking with 3.3.x as it seems very stable as far as the nouveau suspend/resume problems go: I am getting an occasional "blip" (resume failure) about once every 15-20 suspend/resume cycles, which to me is as good and as manageable as it could be expected, given what I experienced with kernels 3.4, 3.5 and 3.6.

I am currently waiting for 3.7 to become mainstream for my distro (I am using Fedora) so that I could give it a go, but I don't hold my breath to be honest.

Just FYI: the problems I have experienced with kernels 3.4 - 3.6 have been the same as explained by a few others on here - every second suspend/resume cycle is a complete and utter failure, without exception. It's like a clockwork.

I haven't applied the patch in Comment#37, but will do as soon as 3.7 becomes "mainstream" in my distro and will post the results on here.
Comment 46 Lucas Stach 2013-01-04 11:23:07 UTC
(In reply to comment #45)
> I am currently waiting for 3.7 to become mainstream for my distro (I am
> using Fedora) so that I could give it a go, but I don't hold my breath to be
> honest.
> 
It will be a while until Fedora ships 3.7+ kernels by default. If you want to help test vanilla kernels there is a well maintained repo to do so.

See https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories
Comment 47 Marcin Slusarz 2013-01-04 14:04:10 UTC
michael.weirauch: this seems to be another bug
Comment 48 Mr-4 2013-01-04 23:54:56 UTC
(In reply to comment #46)
> It will be a while until Fedora ships 3.7+ kernels by default. If you want
> to help test vanilla kernels there is a well maintained repo to do so.
> 
> See https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories
Then I am better off adopting the patch in Comment#31 to work with 3.6.10 because I am not going to turn my development machine, which I use nearly 24/7 into a test harness.
Comment 49 Mr-4 2013-01-05 07:20:55 UTC
(In reply to comment #38)
> This patch is not directly applicable to kernels < 3.7, although porting it
> is trivial.
"Porting" of that patch against 3.6 is far from "trivial" - exercise in futility more like!

There is no "nv50_fence.c" to start off with, "nouveau_drm" struct isn't there either, let alone that "nv10_fence_priv" doesn't seem to have "base.resume" defined at all.

I think I am going to stick with 3.3 until 3.7 comes out and then decide what to do next...
Comment 50 Ronald 2013-01-05 12:57:05 UTC
@Mr-4, I assumed that most of the rework went in at v3.5 regarding power management. But v3.7 had the huge overhaul everyone is talking about. So therefore I assumed that the patch might work at v3.5 and v3.6 as well. However, both kernels are EOL'ed. Sorry for the confusion and the false hope...

Maybe you can send a patch to Fedora? Maybe they will happily backport it for you. Especially since they try to be bleeding edge and all...
Comment 51 Mr-4 2013-01-05 14:56:19 UTC
(In reply to comment #50)
> Maybe you can send a patch to Fedora? Maybe they will happily backport it
> for you. Especially since they try to be bleeding edge and all...
I would have if I knew how to fix it and make it work, since I assumed the changes will be ... (ahem) quite "trivial".

As it turns out, they aren't and my knowledge of this driver's inner workings isn't enough for me to re-create the patch for 3.6, which, incidentally, is still the latest mainstream-supported kernel by my distro. 

Ben Skeggs, who is responsible for the nvidia branch at Fedora (and who also appears listed in this bug) could do that - whether he chooses to is up to him, though given my past experiences with nvidia team @Fedora, I am not holding my breath.
Comment 52 Florian Mickler 2013-01-19 23:03:43 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc4:

commit f20ebd034eab43fd38c58b11c5bb5fb125e5f7d7
Author: Marcin Slusarz <marcin.slusarz@gmail.com>
Date:   Tue Dec 25 18:13:22 2012 +0100

    drm/nv17-50: restore fence buffer on resume
Comment 53 Herton Krzesinski 2013-01-22 04:14:57 UTC
Created attachment 73421 [details] [review]
Backport for 3.5 (and may be 3.6)

For who is still using or interested in kernel 3.5 or 3.6, can you try/verify the attached patch? I checked and hope that it works, ended up being a simpler diff.
Comment 54 Herton Krzesinski 2013-01-23 04:36:37 UTC
Created attachment 73488 [details] [review]
Backport for 3.5 (and may be 3.6) - fixed

Sorry, the patch I posted had a small typo, this one is ok...
Comment 55 Ronald 2013-01-23 08:04:41 UTC
I will try to test this weekend, cannot promise anything though.
Comment 56 Ronald 2013-01-23 15:31:58 UTC
Spot on! It worked :)

Just found a small window to test:

- 3.5.7+patch(v2): success! 5 cycles, clean dmesg
- 3.6.9+patch(v2): success! 5 cycles, clean dmesg

Backporting this will make a lot of people happy. Thank you.
Comment 57 Marcin Slusarz 2013-01-23 17:24:35 UTC
This backported patch will crash on nv10-nv17 - priv->bo is only created for chipsets >= nv17.
Comment 58 Herton Krzesinski 2013-01-24 17:54:14 UTC
Created attachment 73597 [details] [review]
Backport for 3.5 (and may be 3.6) - fixed again

argh, thanks. This is the fixed version.
Comment 59 Mr-4 2013-01-31 00:46:23 UTC
(In reply to comment #58)
> Created attachment 73597 [details] [review] [review]
> Backport for 3.5 (and may be 3.6) - fixed again
I have to say, I didn't expect for this to be fixed, but this time it looks as though this nasty bug is being squashed. I have been doing hibernate/restore like mad (a couple of times a day) for the past week and not a single glitch, not one!

My one-and-only gripe with this (and it really is a minor one, given the circumstances) is that since I used a "fork" of the mainstream driver from Martin Peres (mupuf) at gitorious (http://gitorious.org/linux-nouveau-pm/linux-nouveau-pm/commits/thermal) to take advantage of the automatic fan management so that I don't have to execute "echo 40 > /sys/class/drm/card0/device/pwm0" every time I come out of hibernation and stop my nvidia fan wailing like a jet engine, this is still not implemented in the main stream driver code.

I used to have this so that when I come out of hibernate the automatic fan management takes over and reduces my fan speed according to the current temperature of the video card. It worked a treat, but I don't seem to be getting this any more and applying the patch which fixed the above bug is not possible against Martin's fork, unfortunately.
Comment 60 Ronald 2013-01-31 08:04:41 UTC
(In reply to comment #59)

You can put the following line:

"echo 40 > /sys/class/drm/card0/device/pwm0"

In a file in

/usr/lib/pm-utils/sleep.d/

Check out the other files on the format/arguments. Make sure you have the permissions right (root:root 700).
Comment 61 Mr-4 2013-01-31 13:36:51 UTC
(In reply to comment #60)
> You can put the following line:
> 
> "echo 40 > /sys/class/drm/card0/device/pwm0"
> 
> In a file in
> 
> /usr/lib/pm-utils/sleep.d/
Thanks Roland, I'll try that, though, ideally, I would have liked to get my automatic fan management back - it was absolutely flawless. I don't know why it was discarded from the mainstream tree.
Comment 62 francoism 2013-02-16 17:09:18 UTC
Sorry for not replying (for a long time).

I'm still having this issue when resume (gives RGB-stripes/freeze).

The following packages are installed from the Arch Linux repo:
nouveau-dri 9.0.2-1
xf86-video-nouveau 1.0.6-1
linux 3.7.8-1

I have not tried the GIT-version in AUR.

The GPU is an EVGA nVidia GeForce 450 GTS (1GB RAM).
Comment 63 Ronald 2013-02-16 17:14:59 UTC
Did you apply the patch from comment #36?

https://bugs.freedesktop.org/show_bug.cgi?id=50121#c36

Only v3.8 carries the fix.
Comment 64 Mr-4 2013-02-16 17:45:47 UTC
(In reply to comment #63)
> Only v3.8 carries the fix.
Nope! My own distro (Fedora, as it turns out) had this patch already in their main nv tree as of kernel 3.7.4, so there was no need for me to apply it separately - I just built the kernel and had zero problems with my nv card ever since.
Comment 65 Petr Stastny 2013-02-19 08:55:27 UTC
(In reply to comment #44)
I am experiencing the same problem on my w520 with Nvidia GF106 [Quadro 2000M].. I get the same symptoms / messages. Did you find a fix for the problem?

I noticed that suspend / resume works if using libdrm-nouveau1a only (without libdrm-nouveau2), however I the mouse gets lost on resume. If using libdrm-nouveau2, I get the problems.. Maybe the problem is somehow related to dri2?

> Am I still in the scope of this bug with the following endlessly repeating
> errors on resume? (Which will go away if killing X.)
> 
> [ 1623.833303] nouveau E[  PGRAPH][0000:01:00.0] TRAP ch 1 [0x007fe00000
> Xorg[1009]]
> [ 1623.833311] nouveau E[  PGRAPH][0000:01:00.0] SHADER 0xa004021e
Comment 66 michael.weirauch 2013-02-19 09:12:09 UTC
(In reply to comment #65)
> (In reply to comment #44)
> I am experiencing the same problem on my w520 with Nvidia GF106 [Quadro
> 2000M].. I get the same symptoms / messages. Did you find a fix for the
> problem?
> 
> I noticed that suspend / resume works if using libdrm-nouveau1a only
> (without libdrm-nouveau2), however I the mouse gets lost on resume. If using
> libdrm-nouveau2, I get the problems.. Maybe the problem is somehow related
> to dri2?
> 
> > Am I still in the scope of this bug with the following endlessly repeating
> > errors on resume? (Which will go away if killing X.)
> > 
> > [ 1623.833303] nouveau E[  PGRAPH][0000:01:00.0] TRAP ch 1 [0x007fe00000
> > Xorg[1009]]
> > [ 1623.833311] nouveau E[  PGRAPH][0000:01:00.0] SHADER 0xa004021e

Very interessting comments. Please attach this info again to bug 59168 where the W520 lovers hang around regarding this issue.
Comment 67 Mr-4 2013-02-25 18:24:28 UTC
Here we go again...

After experiencing the joy and the meaning of "stability" and what it feels like, I decided to upgrade to kernel 3.7.9 (the latest stable) from 3.7.4 and voila... this bug reared its ugly head again after the first hibernate/restore cycle.

This is what I get:

kernel: PM: Syncing filesystems ... done.
kernel: Freezing user space processes ... (elapsed 0.01 seconds) done.
kernel: PM: Preallocating image memory... done (allocated 220534 pages)
kernel: PM: Allocated 882136 kbytes in 0.25 seconds (3528.54 MB/s)
kernel: Freezing remaining freezable tasks ... (elapsed 3.29 seconds) done.
kernel: Suspending console(s) (use no_console_suspend to debug)
kernel: sd 2:0:0:0: [sda] Synchronizing SCSI cache
kernel: i8042 kbd 00:0a: wake-up capability enabled by ACPI
kernel: mpu401 00:05: disabled
kernel: nouveau  [     DRM] suspending fbcon...
kernel: nouveau  [     DRM] suspending display...
kernel: nouveau  [     DRM] unpinning framebuffer(s)...
kernel: nouveau  [     DRM] evicting buffers...
kernel: pciehp 0000:00:02.0:pcie04: pciehp_suspend ENTRY
kernel: agpgart-via 0000:00:00.0: Refused to change power state, currently in D0
kernel: pci 0000:00:13.1: wake-up capability enabled by ACPI
kernel: nouveau  [     DRM] suspending client object trees...
kernel: PM: freeze of devices complete after 343.943 msecs
kernel: PM: late freeze of devices complete after 0.310 msecs
kernel: PM: noirq freeze of devices complete after 0.488 msecs
kernel: ACPI: Preparing to enter system sleep state S4
kernel: PM: Saving platform NVS memory
kernel: Disabling non-boot CPUs ...
kernel: Broke affinity for irq 1
kernel: Broke affinity for irq 16
kernel: smpboot: CPU 1 is now offline
kernel: PM: Creating hibernation image:
kernel: PM: Need to copy 169979 pages
kernel: PM: Restoring platform NVS memory
kernel: Enabling non-boot CPUs ...
kernel: smpboot: Booting Node 0 Processor 1 APIC 0x1
kernel: CPU1 is up
kernel: ACPI: Waking up from system sleep state S4
kernel: PM: noirq restore of devices complete after 22.507 msecs
kernel: PM: early restore of devices complete after 0.121 msecs
kernel: pciehp 0000:00:02.0:pcie04: pciehp_resume ENTRY
kernel: nouveau  [     DRM] re-enabling device...
kernel: nouveau  [     DRM] resuming client object trees...
kernel: nouveau  [   VBIOS][0000:01:00.0] running init tables
kernel: pci 0000:00:13.1: wake-up capability disabled by ACPI
kernel: mpu401 00:05: activated
kernel: i8042 kbd 00:0a: wake-up capability disabled by ACPI
kernel: agpgart-via 0000:00:00.0: AGP 3.5 bridge
kernel: agpgart: kworker/u:1 tried to set rate=x12. Setting to AGP3 x8 mode.
kernel: agpgart-via 0000:00:00.0: putting AGP V3 device into 8x mode
kernel: nouveau 0000:01:00.0: putting AGP V3 device into 8x mode
kernel: nouveau  [     DRM] resuming display...
kernel: nouveau  [   PFIFO][0000:01:00.0] unknown intr 0x00010000, ch 0
[...ad infinitum...]
kernel: nouveau  [   PFIFO][0000:01:00.0] still angry after 101 spins, halt
kernel: nouveau  [     DRM] 0xD3FB: Parsing digital output script table
kernel: nouveau  [     DRM] Setting dpms mode 3 on TV encoder (output 3)
kernel: nouveau  [     DRM] 0xD3FB: Parsing digital output script table
abrt[3701]: not dumping repeating crash in '/usr/bin/Xorg'
kernel: nouveau E[    3644] failed to idle channel 0xcccc0000
abrt[3716]: saved core dump of pid 3710 (/usr/bin/Xorg) to /var/spool/abrt/ccpp-1361815666-3710.new/coredump (3674112 bytes)
abrtd: Directory 'ccpp-1361815666-3710' creation detected
abrtd: Crash is in database already (dup of /var/spool/abrt/ccpp-1361815646-2078)
abrtd: Deleting crash ccpp-1361815666-3710 (dup of ccpp-1361815646-2078), sending dbus signal
kernel: nouveau E[    3710] failed to idle channel 0xcccc0000
abrt[3726]: not dumping repeating crash in '/usr/bin/Xorg'
kernel: nouveau E[    3720] failed to idle channel 0xcccc0000
[...the last two messages repeat ad-infinitum...]

So, I think that whatever was done in the nvidia kernel code after 3.7.4 (I haven't tested with version prior to 3.7.9) brought that nasty bug back to life and I think it should be reopened, seeing that I am not the only one experiencing this...
Comment 68 Ronald 2013-02-25 19:40:05 UTC
Is it consistent, git says 'no':

ronald@Alpha /usr/src/linux :) $ git log --oneline --no-merges v3.7.4...v3.7.9 -- drivers/gpu/drm
330b8ab drm/nouveau: add lockdep annotations
c7fc196 drm/radeon: Calling object_unrefer() when creating fb failure
b666341 drm/radeon: prevent crash in the ring space allocation
3e15c0b drm/radeon: protect against div by 0 in backend setup
f69c00e drm/radeon: fix backend map setup on 1 RB sumo boards
cb23301 drm/radeon: fix MC blackout on evergreen+
04589e1 drm/radeon: add quirk for RV100 board
4e370f5 drm/radeon: add WAIT_UNTIL to the non-VM safe regs list for cayman/TN
83f83a0 drm/radeon/evergreen+: wait for the MC to settle after MC blackout
ec8a7ca drm/i915: fix FORCEWAKE posting reads
710c8f5 drm/radeon: fix a rare case of double kfree
c140fec drm/radeon: fix error path in kpage allocation
72b56e8 efi: Make 'efi_enabled' a function to query EFI facilities
8a9d24b drm/i915: dump UTS_RELEASE into the error_state
36ce28c drm/i915: GFX_MODE Flush TLB Invalidate Mode must be '1' for scanline waits
924782c drm/i915: Disable AsyncFlip performance optimisations
18c8e49 radeon_display: Use pointer return error codes
45bced1 drm/radeon: fix cursor corruption on DCE6 and newer
dd1b4df drm/i915: Implement WaDisableHiZPlanesWhenMSAAEnabled
5afeb70 drm/i915: Invalidate the relocation presumed_offsets along the slow path

Don't think the lockdep annotations do anything that changes the code. But you might want to give it a try...
Comment 69 Ronald 2013-02-25 19:42:42 UTC
Whoops, typo in first sentence. What I meant was:

Is the crash consistent? Does it happen every time? The fix should be applied at v3.7.3:

ronald@Alpha /usr/src/linux :) $ git log --oneline --no-merges v3.7^...v3.7.9 -- drivers/gpu/drm/nouveau Makefile
5b7be63 Linux 3.7.9
7773647 Linux 3.7.8
330b8ab drm/nouveau: add lockdep annotations
89c5f13 Linux 3.7.7
07c4ee0 Linux 3.7.6
13280f4 Linux 3.7.5
8380f1a arm64: makefile: fix uname munging when setting ARCH on native machine
8a69ca2 Linux 3.7.4
078314e Linux 3.7.3
39c5cda drm/nvc0/fb: fix crash when different mutex is used to protect same list
c6f94c3 drm/nouveau/clock: fix support for more than 2 monitors on nve0
7fbc316 drm/nouveau: add locking around instobj list operations
4d64387 drm/nouveau: fix blank LVDS screen regression on pre-nv50 cards
ad30b29 drm/nv17-50: restore fence buffer on resume
2370a21 drm/prime: drop reference on imported dma-buf come from gem
cfdfb8f drm/nouveau: fix init with agpgart-uninorth
a77af8b kbuild: Do not remove vmlinux when cleaning external module
e6577f3 Linux 3.7.2
cc86050 Linux 3.7.1
2959440 Linux 3.7

Yes, I'm showing off my newly attained git skillz =P
Comment 70 Mr-4 2013-02-27 03:17:42 UTC
(In reply to comment #69)
> Is the crash consistent? Does it happen every time? The fix should be
> applied at v3.7.3:
No.

I've downgraded the kernel slightly - from 3.7.9-201 to 3.7.9-101 (both official releases from Fedora) and after the first hibernate/restore cycle I get:

nouveau  [   PFIFO][0000:01:00.0] DMA_PUSHER - Ch 0 Get 0x00000004 Put 0x000000c0 State 0x80000720 (err: INVALID_CMD) Push 0x00000000


However, the restore seems to work OK - I have no side effects after it is done, apart from the above message, so this might be an indication on what goes wrong - don't know, but thought to post it here.
Comment 71 Ronald 2013-02-27 05:32:01 UTC
You should test with a vanilla kernel to exclude distro tampering of the kernel. Furthermore, maybe test if the above patch introduces this regression?

Whatever you do, I suggest you open a new bug for this to get some more exposure.
Comment 72 Mr-4 2013-04-30 00:48:59 UTC
Things have been pretty stable with 3.8.1, but it is getting a bit "twitchy" with 3.8.7 - on two separate occasions I've got the following error on restore:

kernel: nouveau  [     DRM] re-enabling device...
kernel: nouveau  [     DRM] resuming client object trees...
kernel: nouveau  [   VBIOS][0000:01:00.0] running init tables
kernel: nouveau  [     DRM] resuming display...
kernel: nouveau W[   PFIFO][0000:01:00.0] unknown intr 0x00010000, ch 0
kernel: nouveau W[   PFIFO][0000:01:00.0] unknown intr 0x00010000, ch 0
kernel: nouveau W[   PFIFO][0000:01:00.0] unknown intr 0x00010000, ch 0
[...]
kernel: nouveau E[   PFIFO][0000:01:00.0] still angry after 101 spins, halt
kernel: nouveau  [     DRM] 0xD3FB: Parsing digital output script table
kernel: nouveau  [     DRM] Setting dpms mode 3 on TV encoder (output 3)
kernel: nouveau  [     DRM] 0xD3FB: Parsing digital output script table


at which point the whole screen starts with the now-familiar black-and-white rectangles and I have no option, but to reboot.

The above error does not happen very often - maybe every 8-10 hibernate/resume cycles, but with the 3.8.1. kernel I didn't have any errors at all.
Comment 73 Ronald 2013-04-30 07:57:04 UTC
Assuming it's nouveau specific and not drm:

2705de0 drm/nouveau: fix handling empty channel list in ioctl's

is between 3.8.6 and 3.8.7. However, there are some generic DRM changes as well:

faec22f KMS: fix EDID detailed timing frame rate
eaa1a61 KMS: fix EDID detailed timing vsync parsing

To be honest, my card is wonky too. Sometimes the screen goes black and it oopses (I think). But I think some of it is also old age...

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.