Bug 45092

Summary: [965GM] broken swizzling in swap-in/out paths/L-shaped memory swizzling
Product: DRI Reporter: Johnny Wezel <freedesktop-jay>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: low CC: chris, daniel, ildar, jbarnes, lohmaier, marius, ysangkok
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=14544
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Typical appearance of garbage
none
X log
none
dmesg
none
Output of xrandr --verbose
none
Damaged desktop
none
Output of intel_reg_dumper
none
Damaged GTK+ icons
none
Hack to prevent movement of swizzled pages
none
Corrupted desktop none

Description Johnny Wezel 2012-01-22 08:43:53 UTC
Created attachment 55974 [details]
Typical appearance of garbage

After running memory intensive programs I often have garbled UIs, most often in GTK+ programs. I started to suspect memory read back from disk for this problem and run a little test driving out a lot of memory to disk and artifacts showed. Someone else on gentoo.org said he had exactly the same sort of artifacts after hibernating to disk.

System environment: 
-- chipset: "Intel 965GM"
-- system architecture: i686 32-bit
-- xf86-video-intel: "2.17.0"
-- xserver: "X.Org X Server 1.11.2"
-- mesa: 7.11.2
-- libdrm: 2.4.27
-- kernel: "Linux beluga 3.1.6-gentoo #1 SMP PREEMPT Sun Jan 22 00:29:26 CET 2012 i686 Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz GenuineIntel GNU/Linux"
-- Linux distribution: gentoo
-- Machine or mobo model: i686 Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz GenuineIntel
-- Display connector: LVDS1

Unable to build intel-gpu-tools due to xorg version mismatch.
Comment 1 Johnny Wezel 2012-01-22 08:45:23 UTC
Created attachment 55975 [details]
X log
Comment 2 Johnny Wezel 2012-01-22 08:47:51 UTC
Created attachment 55976 [details]
dmesg
Comment 3 Johnny Wezel 2012-01-22 08:48:54 UTC
Created attachment 55977 [details]
Output of xrandr --verbose
Comment 4 Chris Wilson 2012-01-22 08:55:14 UTC
Possibly related bug 28813. Or we may just have never got the swizzling correct for crestline?

Johnny, can you please download intel-gpu_tools from http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/ and run through a make test. However we may need to throw in a memory hog in order to exercise the swap paths.
Comment 5 Daniel Vetter 2012-01-22 12:16:01 UTC
Ok, a few things:
- Can you try to grab another screenshot, preferrably where the background image (or any other large image) shows some corruptions? Because gui elements are usually pretty small and have mostly uniform&flat coloring it's much harder to see the pattern.

- Please attach the output of intel_reg_dumper from the intel-gpu-tools (at last v1.1, prefarrably git). You need to install a bunch of dependencies for that to compile.

- As Chris mentioned, please run the i-g-t testsuite. If that one blows up, the relevant tests are tests/gem_tiled_*, you can run these manually.
Comment 6 Johnny Wezel 2012-01-22 15:50:46 UTC
Created attachment 56003 [details]
Damaged desktop
Comment 7 Johnny Wezel 2012-01-22 16:30:53 UTC
I have uploaded a shot of the desktop with damages.

As I already wrote, I can't build intel-gpu-tools because of an error. Here's the output of autogen.sh:

autoreconf-2.68: Entering directory `.'
autoreconf-2.68: configure.ac: not using Gettext
autoreconf-2.68: running: aclocal 
configure.ac:52: error: xorg-macros version 1.16 or higher is required but 1.15.0 found                                                                      
/usr/share/aclocal/xorg-macros.m4:39: XORG_MACROS_VERSION is expanded from...                                                                                
configure.ac:52: the top level                                                                                                                               
autom4te-2.68: /usr/bin/m4 failed with exit status: 1                                                                                                        
aclocal-1.11: /usr/bin/autom4te-2.68 failed with exit status: 1                                                                                              
autoreconf-2.68: aclocal failed with exit status: 1

I guess to succeed, I have to have a newer util-macros package for which I have to have a newer X server. I don't know whether I want to do that. Can't we get around without it?
Comment 8 Daniel Vetter 2012-01-23 01:21:14 UTC
> --- Comment #7 from Johnny Wezel <freedesktop-
> I guess to succeed, I have to have a newer util-macros package for which I have
> to have a newer X server. I don't know whether I want to do that. Can't we get
> around without it?

Nope, newer X server not required at all. Either you upgarde the dev
package (xutils-dev on debian) or you grab the single missing file
from source:

http://cgit.freedesktop.org/xorg/util/macros/

Run ./autogen.sh and copy xorg-macros.m4 into /usr/share/aclocal

That should make this work. Btw for quick questions like this it's
usually faster to ask for help on irc, #intel-gfx on the freenode
network.
Comment 9 Johnny Wezel 2012-01-23 09:13:27 UTC
Created attachment 56044 [details]
Output of intel_reg_dumper
Comment 10 Johnny Wezel 2012-01-23 09:14:45 UTC
OK, got it with intel_reg_dumper (had to update libdrm and check whether the bug is still there because of that [yes, it is])
Comment 11 Daniel Vetter 2012-01-23 09:46:52 UTC
Thanks a lot for the reg dump and the screenshot, perfect match with what Chris suspected. Can you please also grab the same registers as in https://bugs.freedesktop.org/show_bug.cgi?id=28813#c21
Comment 12 Johnny Wezel 2012-01-23 14:44:28 UTC
Sure:

0x10200 : 0xF0002
0x10204 : 0x0
0x100E0 : 0x0
0x11234 : 0x910C1800
0x11334 : 0x910C1800
Comment 13 Daniel Vetter 2012-01-25 06:37:28 UTC
Can you try to set bit20 in register 0x10204 like this?

intel_reg_write 0x10204 0x100000

Note though that it's unclear from the documentation what this bit exactly does, and it has the potential to corrupt system memory (and not just graphics stuff). So I highly advise you to try this on a throw-away disk/installation.

But it's the only thing I could find, so please try it if you can.
Comment 14 Johnny Wezel 2012-01-25 10:10:38 UTC
This is the output of the command:

Value before: 0x0
Value after: 0x0

There is no effect from the command. Problems persist.

I'm not sure whether this helps but another effect of swapped back memory is that in GTK+ programs, icons lose their images, like shown in the last screenshot.
Comment 15 Johnny Wezel 2012-01-25 10:11:56 UTC
Created attachment 56154 [details]
Damaged GTK+ icons

There is no way to make the icon's images to reappear.
Comment 16 Daniel Vetter 2012-01-26 03:12:27 UTC
Ok, so the hw doesn't allow this bit to be flipped after initialization. Which makes sense. I need to do more documentation reading and also some patch writing before I'll have something new for you to test.

For the damaged icons: Does this not happen when swap is disabled?
Comment 17 Chris Wilson 2012-01-26 03:30:26 UTC
The icons look more like the ddx bug:

commit 2174f840158aa9cfa370ade38be28f8dc8e4b526
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Nov 3 20:41:31 2011 +0000

    uxa: Remove caching of surface binding location
    
    If the pixmap were to be used multiple times within a batch with
    mulitple formats, the cache would only return the initial location with
    the incorrect format and so cause rendering glitches. For instance, GTK+
    uses the same pixmap as an xrgb source and as an argb mask in order to
    premultiply and composite in a single pass. Rather than introduce an
    overly complication caching (handle, format) mechanism, kiss and remove
    the invalid implementation.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=40926
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

release in 2.16.902
Comment 18 Johnny Wezel 2012-01-26 10:40:30 UTC
The icons blacken only after swapping. IMHO this is not a GTK+ bug.
Comment 19 Daniel Vetter 2012-02-20 03:58:09 UTC
*** Bug 46178 has been marked as a duplicate of this bug. ***
Comment 20 Marius Gedminas 2012-04-11 07:33:19 UTC
Got this again, under GNOME Shell, after using suspend to disk.  (I've a screenshot if you need more of them.)  Restarting GNOME Shell with Alt-F2 r fixed the corruption.

Is there anything I can to do help debug this?
Comment 21 Daniel Vetter 2012-04-11 07:35:41 UTC
(In reply to comment #20)
> Got this again, under GNOME Shell, after using suspend to disk.  (I've a
> screenshot if you need more of them.)  Restarting GNOME Shell with Alt-F2 r
> fixed the corruption.
> 
> Is there anything I can to do help debug this?

Unfortunately not. I have a machine which has this issue, too. And we have tests in i-g-t that can easily reproduce it. The problem is simply that I have no idea how to fix it (without rewriting the entire driver, that is).
Comment 22 Chris Wilson 2012-04-11 07:39:45 UTC
What's the grand plan? Everyone needs to be aware of physical page locations and swizzling again?
Comment 23 Daniel Vetter 2012-04-11 07:58:29 UTC
One thing you could try is to grab the latest verion of intel-gpu-tools from git and run the testsuite with make test. That should give you a nice set of failing tests with "tiled" in their names.

Then try to decrease the amount of memory linux uses with the mem=xxxm boot parameter, until all these tests with 'tiled' work reliably. The best result would be to figure out that things work for mem=xxxm, but not for mem=xxx+1m.
Comment 24 Chris Wilson 2013-07-03 17:54:27 UTC
Created attachment 81971 [details] [review]
Hack to prevent movement of swizzled pages

A hack for you to please test.
Comment 25 Jani Nikula 2013-12-17 11:39:43 UTC
Timeout. What is the status of the bug?

Johnny, still an issue? Did you try the proposed hack patch?
Comment 26 Abdelfetah Hadij 2014-02-28 20:56:07 UTC
Hi i had the same problem in debian Wheezy and i managed to solve changing my hibernation method to TuxOnIce.

I hope this extra info could help solve the problem.
Comment 27 Ildar Muyukov 2014-03-19 11:43:03 UTC
Excuse me, what are the workarounds for the problem besides patching the drm kernel module? is it possible to invalidate pixmap cache that GTK uses?

2nd: the bug is in "NEEDINFO" state. What kind of info is needed?
Comment 28 Daniel Vetter 2014-03-26 22:21:42 UTC
Disabling swap, replace with more ram.
Comment 29 Rodrigo Vivi 2014-10-08 22:29:47 UTC
And by the time, please retest with latest drm-intel-nightly
Comment 30 Chris Wilson 2014-10-09 06:17:40 UTC
(In reply to Rodrigo Vivi from comment #29)
> And by the time, please retest with latest drm-intel-nightly

Known hardware issue that remains unresolved. The patch we want tested is attached to this bug.
Comment 31 Daniel Vetter 2014-11-18 13:43:07 UTC
Ok I've finally gotten around to polish Chris' patch and update testcase:

http://patchwork.freedesktop.org/patch/37073/

As soon as I have a few  tested-by reports I'll pull this in, so please go wild. Patch applies on top of latest drm-intel-nightly.
Comment 32 Daniel Vetter 2014-11-20 10:27:36 UTC
Workaround is now merged into drm-intel-nightly, should land in 3.19

commit 14a369b6c9bdb40cebdac5a248321a05119fe02b
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Nov 20 09:26:30 2014 +0100

    drm/i915: Pin tiled objects for L-shaped configs

Note that this is v2, v1 was a bit WARNING-happy.
Comment 33 Ander Conselvan de Oliveira 2014-12-15 14:02:58 UTC
(In reply to Daniel Vetter from comment #32)
> Workaround is now merged into drm-intel-nightly, should land in 3.19
> 
> commit 14a369b6c9bdb40cebdac5a248321a05119fe02b
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Thu Nov 20 09:26:30 2014 +0100
> 
>     drm/i915: Pin tiled objects for L-shaped configs
> 
> Note that this is v2, v1 was a bit WARNING-happy.

Assuming the problem was fixed by the patch above, since it has been almost a month. Please reopen if necessary.
Comment 34 Janus Troelsen 2014-12-25 15:56:24 UTC
I have this chip, running with version 2:2.99.916+git201412 of the Intel driver from xorg-edgers didn't help. I used the DebugWait trick from bug 37326 and it seemed to help.
Comment 35 Janus Troelsen 2014-12-25 15:57:52 UTC
Created attachment 111328 [details]
Corrupted desktop
Comment 36 Janus Troelsen 2014-12-25 15:59:43 UTC
I have 3145990144 bytes of RAM and I think it may also be bug 55000. But since that bug does not mention the DebugWait workaround, and it works for me, I don't know what to think.
Comment 37 Janus Troelsen 2014-12-25 16:42:46 UTC
I was wrong, DebugWait didn't help. I disabled DebugWait again of course.

It looks as if upgrading to kernel v3.19-rc1 did though.

I got the kernel here: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19-rc1-vivid/

However, the commit ID of the patch doesn't match the one Daniel Vetter gave above. I found this though, with the same commit description: https://github.com/torvalds/linux/commit/656bfa3afc14e45e2d9e1624bf60d79b3beb12f2

In debugfs, I can see that the new code paths are getting hit:

# cat /sys/kernel/debug/dri/0/i915_swizzle_info
bit6 swizzle for X-tiling = bit9/bit10/bit11
bit6 swizzle for Y-tiling = bit9/bit11
DDC = 0x000f0002
DDC2 = 0x00000000
C0DRB3 = 0x0000
C1DRB3 = 0x0000
L-shaped memory detected

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.