Bug 95031

Summary: [NVE4] 660 Ti Random GPU lockups
Product: xorg Reporter: Lucas Ribeiro <lucasout>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
nouveau bugs
none
another log with different info
none
vbios
none
kernel log 3 none

Description Lucas Ribeiro 2016-04-20 07:44:15 UTC
Created attachment 123085 [details]
nouveau bugs

Having random lockups on a GTX 660 Ti (NVE4), since kernel 4.1 I guess, using DRI2.
[    0.267666] nouveau 0000:02:00.0: NVIDIA GK104 (0e4030a2)
[    0.378583] nouveau 0000:02:00.0: bios: version 80.04.4b.00.1a
[    0.379302] nouveau 0000:02:00.0: fb: 2048 MiB GDDR5

Now on gentoo ~amd64 using:

sys-kernel/gentoo-sources-4.5.1
x11-base/xorg-server-1.18.3
x11-drivers/xf86-video-nouveau-1.0.12

Also tried Karol Herbst reclocking branch v4 (https://github.com/karolherbst/nouveau/tree/stable_reclocking_kepler_v4), reclocked to pstate 07 and tried all 3 boost states. All hang sooner or later.

Will continue testing other pstates and boost configurations.

Sometimes the kernel log becomes corrupted, but I managed to get a working log (attached).
Comment 1 Lucas Ribeiro 2016-04-20 07:52:04 UTC
Forgot to add: did not try earlier kernels, so this behaviour might exist since the card was supported. On Windows it works well.

It hangs on normal browsing or opening a video on VLC.
Comment 2 Lucas Ribeiro 2016-04-20 20:54:23 UTC
Different kernel log with some nouveau info.

So far tried pstate 07 with boost 0, 1 and 2.
Comment 3 Lucas Ribeiro 2016-04-20 20:54:49 UTC
Created attachment 123098 [details]
another log with different info
Comment 4 Lucas Ribeiro 2016-04-21 03:12:30 UTC
OK, some interesting findings.

07 pstate on Linux has:
core 324 MHz memory 648 MHz AC DC
GPU core:     +0.99 V
Increasing GPU core voltage to 1.09V has wielded a stable system so far.

On Windows, the idle state has (checked with gpu-z):
core 324 MHz memory 162MHz
GPU core voltage: 0.99V
It has never managed to hang.

So maybe 07 pstate on Linux has a memory clock too high for its voltage. Also, as it is the lowest pstate, maybe memory clock could be reduced further to 162MHz (as in Windows).
Comment 5 Lucas Ribeiro 2016-04-21 04:38:46 UTC
Created attachment 123105 [details]
vbios

Yea, finally managed to hang the system at 07 pstate with 1.09V (+0.1V), with a corrupted kernel log as well. I dunno what else to do. I'm attaching the vbios.
Comment 6 Lucas Ribeiro 2016-04-21 06:23:51 UTC
Created attachment 123106 [details]
kernel log 3

This time running pstate 0f with +0.1V, total 1.15V. Hangs, corrupts kernel log and then starts flooding it with a different error.
Comment 7 Karol Herbst 2016-04-21 08:19:41 UTC
(In reply to Lucas Ribeiro from comment #4)
> 
> So maybe 07 pstate on Linux has a memory clock too high for its voltage.
> Also, as it is the lowest pstate, maybe memory clock could be reduced
> further to 162MHz (as in Windows).

And maybe not. There are other issues which aren't exactly voltage related. If such a high votlage won't help, then it is usually something else, we just have to figure out what it is.

Also regarding the lower clocks: yeah I know that sometimes nvidia clocks further down, but there is no real value in doing so if there is no voltage information for those low clocks and it doesn't make any difference regarding power consumption as far as I know.
Comment 8 Lucas Ribeiro 2016-04-21 17:15:12 UTC
(In reply to Karol Herbst from comment #7)
> (In reply to Lucas Ribeiro from comment #4)
> > 
> > So maybe 07 pstate on Linux has a memory clock too high for its voltage.
> > Also, as it is the lowest pstate, maybe memory clock could be reduced
> > further to 162MHz (as in Windows).
> 
> And maybe not. There are other issues which aren't exactly voltage related.
> If such a high votlage won't help, then it is usually something else, we
> just have to figure out what it is.
> 
> Also regarding the lower clocks: yeah I know that sometimes nvidia clocks
> further down, but there is no real value in doing so if there is no voltage
> information for those low clocks and it doesn't make any difference
> regarding power consumption as far as I know.

Thanks for clearing that up.

I'm out of ideas, should I capture a mmiotrace?
Comment 9 Lucas Ribeiro 2016-05-26 23:43:17 UTC
Running kernel 4.6 has improved the driver. I don't know what changed, but I have yet to see lockups on this card. No out of tree patches applied.

Will post again if I experience a freeze.

Thanks!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.