Created attachment 70082 [details]
dmesg output at time of crash, gzipped
Created attachment 70083 [details]
lspci output showing affected chipset
Created attachment 70084 [details]
Xorg log at time of crash
Just trnasferrring my information from the ML to here. I saw this just the once recently, on 3.7-rc4+ The system was idle and the screen was blanked. I could get out to a virtual consolem but had to reboot to get 3D working again (2D graphics seemed to be working OK) Nov 11 17:36:02 omega kernel: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung Nov 11 17:36:02 omega kernel: [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00003000 tail 00000000 start 00003000 Nov 11 17:36:03 omega kernel: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung Nov 11 17:36:03 omega kernel: [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged! Nov 11 17:36:03 omega kernel: [drm:i915_reset] *ERROR* Failed to reset chip. And it was also gnome-shell which was involved. Nov 11 17:36:14 omega kernel: gnome-shell[15559]: segfault at 0 ip 00007fa0ac65b695 sp 00007fffddca2cd0 error 4 in i965_dri.so[7fa0ac5f1000+3bf000] Unfortunately I neglected to get the i915_error_state This is also on a x86_64 Fedora 16 system with a G41 00:02.0 VGA compatible controller: Intel Corporation 4 Series Chipset Integrated Graphics Controller (rev 03) Now running 3.7-rc5 and not seen this happen again. I did see this thing after a while during Fedora 14 (running kernel.org kernels) but they went away after moving to Fedora 16, until this latest oddity. Here's the mesa/libdrm/intel driver versions. mesa-libGL-7.11.2-3.fc16.x86_64 mesa-libGLU-devel-7.11.2-3.fc16.x86_64 mesa-libGL-devel-7.11.2-3.fc16.x86_64 mesa-libGLU-7.11.2-3.fc16.x86_64 mesa-dri-drivers-7.11.2-3.fc16.x86_64 mesa-dri-filesystem-7.11.2-3.fc16.x86_64 libdrm-2.4.33-1.fc16.x86_64 xorg-x11-drv-intel-2.20.8-1.fc16.x86_64 Created attachment 70112 [details] [review] disable unbound tracking Silly me just noticed that the unbound tracking has been merged into 3.7, not 3.6. This has a big enough impact to explain all kinds of things. Please try the attached patch, thanks. Would this explain why the crash happens on my G41 at work but not on my G33 at home? No, they both utilize unbound pages (if you have the same kernel). The only significant difference will be mesa and the use of reloc-trees. Created attachment 70170 [details] [review] disable cpu relocs completely I'm not completely sure, but I think we haven't ruled this one out yet. Please test, thanks Is this second patch supposed to be applied in addition to the first one, or instead of it? For now, I will assume it should be applied on top of the first one. (In reply to comment #10) > Is this second patch supposed to be applied in addition to the first one, or > instead of it? For now, I will assume it should be applied on top of the > first one. Atm we're lacking a bit clue what's going on, so just a bunch of test patches. You can test them all at once, if it works we can figure out which one fixed things. My graphics session just crashed again in the exact same way as before. I was running 3.7-rc5 and it crashed in the middle of compiling 3.7-rc6. Forgot to tell, I had both test patches applied on top of -rc5. Created attachment 70265 [details]
dmesg, second crash, 3.7-rc5, both test patches applied
Created attachment 70267 [details]
i915_error_state, second crash, 3.7-rc5, both test patches applied
Ok, yet another new theory ... please attach your kernel .config, thanks. Created attachment 70269 [details]
Kernel .config used to build failing kernels
A possibly helpful thing: in my home machine (where the crash does not occur) a sample run of glxgears opens /usr/lib64/dri/i915_dri.so . In my work machine (where the crash occurs), glxgears opens /usr/lib64/dri/i965_dri.so . Created attachment 70276 [details]
3.7-rc .config
I saw this again on Sunday with 3.7-rc6. Attached my .config and for what it's worth, my machine here is also loading i965_dri.so for glxgears
Alex, do you mind giving the tree at http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=for-imre a spin: $ cd /path/to/linux $ git remote add ickle -f git://people.freedesktop.org/~ickle/linux-2.6 $ git checkout ickle/for-imre make; install; test Ok, will do. For reference: [alex@avillacis linux-git]$ git checkout ickle/for-imre Checking out files: 100% (1025/1025), done. Note: checking out 'ickle/for-imre'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b new_branch_name HEAD is now at ab5c8df... drm/i915: Preallocate next seqno before touching the ring I was going to suggest naming it bug57122 (git checkout -b bug57122 ickle/for-imre), but that branch is going to be pretty volatile and you only want it for smoketesting... I am now running the requested branch. However, I am again experiencing the "random stalls in graphics applications" issue that I reported in the mailing list back when I tested master 3.7-rc2 . Here is what I reported at that time: --------start quote-------- I am testing linux-3.7-rc2 in Fedora 16 x86_64 in a workstation at my day job. My kernel configuration is attached. My graphics chipset shows up in lspci as follows: 00:02.0 VGA compatible controller [0300]: Intel Corporation 4 Series Chipset Integrated Graphics Controller [8086:2e32] (rev 03) (prog-if 00 [VGA controller]) Subsystem: Intel Corporation Device [8086:d612] Flags: bus master, fast devsel, latency 0, IRQ 43 Memory at d0000000 (64-bit, non-prefetchable) [size=4M] Memory at c0000000 (64-bit, prefetchable) [size=256M] I/O ports at f140 [size=8] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit- Capabilities: [d0] Power Management version 2 Kernel driver in use: i915 Kernel modules: i915 All the tests were made with the default KMS enabled, as setup by the distro. With the distro-supplied linux-3.6.2-1, I have no problems at all with graphics. Likewise with vanilla kernel 3.6.0. With 3.7-rc1 onwards, and also 3.7-rc2, my workstation seems to boot normally and I can login into Gnome Shell. However, after a while, some random graphical client with which I am interacting stops responding. This stall is of random length - sometimes it lasts a fraction of a second, or any interval up to a few minutes. It seems that anything that causes the app to try to draw to the screen might cause the stall, even something as simple as switching to the app. So far, I have seen stalls while using the following apps: firefox, thunderbird, eclipse, gnome-terminal, and even gnome-shell itself. When gnome-shell is affected, the entire desktop freezes, and becomes unusable. However, in all cases (even gnome-shell stalls), the mouse cursor can be moved, and I can switch into other consoles with Ctrl-Alt-Fn very easily. When I switch to a console, I can run top, and it shows me that the stalled application is apparently burning CPU time at 99%, but the corresponding CPU is busy in "system" time, not "user" time. Interestingly, the xserver process itself has never been seen burning CPU when these stalls happen. I have tried killing the stalled gnome-shell with "kill -9 PID", but it proved unkillable even by this. However I then restarted the X server with Ctrl-Alt-Backspace, and this managed to terminate the same unkillable gnome-shell. I have also attached gdb to the stalled processes. However, the symbol loading goes by at a snails pace, which is unusual. After that I managed to issue bt on two processes. The results are attached. In both backtraces, the innermost function is writev initiated by XPutImage. I have seen nothing unusual for me in the dmesg output (attached) even with a stalled process running. It seems that this problem is unique to my workstation. My home machine has a different Intel chipset (G31 if I remember correctly) but also runs Fedora 16 x86_64 with 3.7-rc2, and has never been affected by this issue. --------end quote-------- Back when I was testing master 3.7-rc2, I was asked to post /proc/PID/stack for the affected process. This is what I get: [alex@avillacis ~]$ cat /proc/2535/stack [<ffffffffffffffff>] 0xffffffffffffffff In the test kernel, sometimes I catch the stalling process with this stack: [<ffffffff816422e6>] retint_kernel+0x26/0x30 [<ffffffff81149feb>] shrink_page_list+0x68b/0xa00 [<ffffffff8114a8cf>] shrink_inactive_list+0x18f/0x450 [<ffffffff8114b318>] shrink_lruvec+0x448/0x560 [<ffffffff8114b4a5>] shrink_zone+0x75/0xa0 [<ffffffff8114b63b>] zone_reclaim+0x16b/0x270 [<ffffffff8113f6c1>] get_page_from_freelist+0x511/0x740 [<ffffffff8113fa98>] __alloc_pages_nodemask+0x1a8/0x9f0 [<ffffffff8117e6a3>] alloc_pages_vma+0xb3/0x190 [<ffffffff81160e39>] handle_pte_fault+0x709/0xab0 [<ffffffff81162469>] handle_mm_fault+0x269/0x340 [<ffffffff8164540c>] __do_page_fault+0x16c/0x5a0 [<ffffffff8164584e>] do_page_fault+0xe/0x10 [<ffffffff81642488>] page_fault+0x28/0x30 [<ffffffffffffffff>] 0xffffffffffffffff (In reply to comment #25) > In the test kernel, sometimes I catch the stalling process with this stack: > > [<ffffffff816422e6>] retint_kernel+0x26/0x30 > [<ffffffff81149feb>] shrink_page_list+0x68b/0xa00 > [<ffffffff8114a8cf>] shrink_inactive_list+0x18f/0x450 > [<ffffffff8114b318>] shrink_lruvec+0x448/0x560 > [<ffffffff8114b4a5>] shrink_zone+0x75/0xa0 > [<ffffffff8114b63b>] zone_reclaim+0x16b/0x270 > [<ffffffff8113f6c1>] get_page_from_freelist+0x511/0x740 > [<ffffffff8113fa98>] __alloc_pages_nodemask+0x1a8/0x9f0 > [<ffffffff8117e6a3>] alloc_pages_vma+0xb3/0x190 > [<ffffffff81160e39>] handle_pte_fault+0x709/0xab0 > [<ffffffff81162469>] handle_mm_fault+0x269/0x340 > [<ffffffff8164540c>] __do_page_fault+0x16c/0x5a0 > [<ffffffff8164584e>] do_page_fault+0xe/0x10 > [<ffffffff81642488>] page_fault+0x28/0x30 > [<ffffffffffffffff>] 0xffffffffffffffff There's a direct-reclaim bug that matches this description in that kernel that is yet to be resolved upstream. Can you please keep testing and comparing the stacks of when it is stalled (or try sudo perf top) and see if it is always the same (or at least similar)? The thing is, this random-stall situation disappeared under 3.7-rc3. Hmmm... the same kernel when I first noticed the graphics crash. Sorry, I cannot keep testing ickle/for-imre because the frequent stalls make the session essentially unusable. Please remember this is my day job machine. So linus/master seems better behaved, so I pushed the merged branch to ickle/for-imre. Alex, if you feel brave... Thanks. After an hour of testing the merge of ickle/for-imre, the graphics session froze, but this time it did not segfault gnome-shell. The screen just froze when I was doing the gesture of moving the mouse pointer to the top-left corner in order to dismiss the gnome-shell window expose. Through a remote ssh connection, I was able to collect some debugging information. Created attachment 70387 [details]
i915_error_state, freeze with ickle/for-imre
Created attachment 70388 [details]
dmesg, freeze with ickle/for-imre
Created attachment 70389 [details]
Xorg log, freeze with ickle/for-imre
Forgot to mention. The merged ickle/for-imre did not exhibit any random stalls at all before the freeze. Well the good news is that is a completely different bug. Again it should be impossible... I've put a smaller selection of patches in ickle/bug55984. It's still a shotgun approach, but a good first step will be to see if it cures the hang.. The ickle/bug55984 branch still hangs on me after about one hour of use. I will post the debugging files again for this hang. Created attachment 70484 [details] i915_error_state, freeze with ickle/bug55984 Created attachment 70485 [details] dmesg, freeze with ickle/bug55984 Created attachment 70486 [details] Xorg log, freeze with ickle/bug55984 Created attachment 70578 [details] [review] Don't force GTT/CPU relocations Today's patch, please test. Unable to apply cleanly: [alex@avillacis linux-ickle-bug55984]$ patch -p1 --dry-run < ../0001-drm-i915-Avoid-forcing-relocations-through-the-mappa.patch patching file drivers/gpu/drm/i915/i915_gem_execbuffer.c Hunk #1 succeeded at 37 with fuzz 2 (offset 4 lines). Hunk #2 FAILED at 98. Hunk #3 FAILED at 205. Hunk #4 FAILED at 231. Hunk #5 FAILED at 335. Hunk #6 FAILED at 352. Hunk #7 FAILED at 424. Hunk #8 FAILED at 435. Hunk #9 FAILED at 467. Hunk #10 FAILED at 476. Hunk #11 FAILED at 659. 10 out of 11 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_gem_execbuffer.c.rej I am trying to apply this on top of ickle/bug55984 . Sorry, I wasn't clear in my intentions. Can you please apply this to 3.7-rc7 or drm-intel-fixes? Ok. Applied on top of vanilla 3.7-rc7 with previous test patches removed. (In reply to comment #44) > Ok. Applied on top of vanilla 3.7-rc7 with previous test patches removed. Sorry for late response. I suddenly realized that I had VirtualBox running. So I had to remove it as a possible factor on the crash. I tested 3.7-rc7 *without* "Don't force GTT/CPU relocations" patch and with KVM/QEMU instead of VirtualBox for my virtual machines, and it ran correctly until today, in which the graphics display hung in the exact same way as last time. Now I will test the patch, again without VirtualBox. Created attachment 70976 [details]
dmesg, freeze with dont-force-gpu-relocations patch
Once again, my session crashed with the dont-force-gpu-relocations patch applied. I was running a KVM/QEMU virtual machine, and scrolling down on a page in firefox. Furthermore, I was unable to capture the i915_error_state file, because both cat and cp complained "out of memory" when trying to read the report. There are some backtraces on the dmesg output. Do they give some clue?
One thing that I have asked elsewhere on a similar bug is to see if you can reproduce the failure with SNA and attach that error-state. If it does reoccur, due to the different layout of the batchbuffer, we can get a fair amount of auxiliary data which may yield a clue. Sorry, I am not familiar at all with "SNA". What is it? What do I have to do to use it? Add Section "Device" Identity "Device0" Driver "intel" Option "AccelMethod" "sna" EndSection to your xorg.conf (or as a snippet in xorg.conf.d). I have run 3.7.0-rc8 with SNA until today without being able to trigger the graphics crash. I am being careful to run QEMU/KVM instead of VirtualBox in order to avoid introducing taint in the kernel. However, I notice some graphic artifacts, such as the middle button of the windows (the one that maximises the window) being blank with the background color, but drawing itself correctly when I hover the mouse over it. I will now test just-released 3.7.0 *without* SNA to see if there is any change. (In reply to comment #50) > However, I notice some > graphic artifacts, such as the middle button of the windows (the one that > maximises the window) being blank with the background color, but drawing > itself correctly when I hover the mouse over it. The GPU is buggy and I'm trying to find a workaround that doesn't kill performance... (In reply to comment #50) > I have run 3.7.0-rc8 with SNA until today without being able to trigger the > graphics crash. I am being careful to run QEMU/KVM instead of VirtualBox in > order to avoid introducing taint in the kernel. However, I notice some > graphic artifacts, such as the middle button of the windows (the one that > maximises the window) being blank with the background color, but drawing > itself correctly when I hover the mouse over it. I will now test > just-released 3.7.0 *without* SNA to see if there is any change. Unfortunately 3.7.0 still exhibits the same graphics crash without SNA - it crashed after a few hours of use. I am now using SNA just so that I have a stable system. Created attachment 71440 [details] [review] Keep reserved objects pinned until after reloction processing. An idea. It should be impossible... (In reply to comment #53) > Created attachment 71440 [details] [review] [review] > Keep reserved objects pinned until after reloction processing. > > An idea. It should be impossible... Applied on top of 3.7. Will start testing shortly, without SNA. Bad luck. The crash still happens after applying the patch. Again, I was unable to capture i915_error_state due to "out of memory" errors. BTW, the issue of graphic artifacts when using SNA under 3.7 should be considered a regression. The 3.6.7-4.fc16.x86_64 distro-supplied kernel does not exhibit said artifacts with SNA. (In reply to comment #56) > BTW, the issue of graphic artifacts when using SNA under 3.7 should be > considered a regression. The 3.6.7-4.fc16.x86_64 distro-supplied kernel does > not exhibit said artifacts with SNA. Ok, then it is not the artifacts I'm aware of (I guess). Can you please try to grab photo or screenshot? Created attachment 71682 [details]
Screenshot showing artifact
This is the artifact I am seen most frequently with SNA (all the time, all the windows). The middle button of the window decoration (gnome-shell) is supposed to show the maximize icon, but instead is blank. This does not occur with the 3.6.x kernel, or with the default (crashing) UXA acceleration.
Is there any monitoring I could perform in the background under 3.7 that will provide information on the root cause of the crash, *before* said crash happens? (In reply to comment #58) > Created attachment 71682 [details] > Screenshot showing artifact > > This is the artifact I am seen most frequently with SNA (all the time, all > the windows). The middle button of the window decoration (gnome-shell) is > supposed to show the maximize icon, but instead is blank. This does not > occur with the 3.6.x kernel, or with the default (crashing) UXA acceleration. I could have sworn that was the CompositeTrapezoids Damage bug (lack of Damage notification sent). But that should also be the case with 3.6. It looks like it should be a small inline trapezoid, in which case it will be upload through an async buffer (either snooped or GTT depending upon state and kernel.) Daniel, can you remember when set_cacheing finally landed? That might indeed be 3.7. (In reply to comment #59) > Is there any monitoring I could perform in the background under 3.7 that > will provide information on the root cause of the crash, *before* said crash > happens? Well, you've debunked my best ideas so far. I'm convinced that the key difference is in the relocation-*tree* used by UXA. But not yet sure how the bug is manifesting itself. In that scenario the most likely culprit is that we reuse stale relocation entries believing that they are valid. Perhaps if we always forced the relocations? Created attachment 71702 [details]
dmesg with 3.7.0, CONFIG_PROVE_LOCKING=y
In an attempt to get some information, I recompiled the kernel with CONFIG_PROVE_LOCKING=y, and added slub_debug=FZPU to the kernel command line. Then, I set the acceleration back to UXA, and left a slabinfo -v running every 5 seconds in the background. After that I started a KVM virtual machine, and some time after that, I got a lock ordering warning in the attached dmesg. Does this shed some light on the graphics issue, or is this a completely separate bug? If not related where should I report this?
Locksplat of zcache vs. pagecache afaict. I'd suggest to send that thing to the linux-kernel mailing list, cc fs-devel directly. Shouldn't be related to the gfx issue at hand here. Please try out the patch at https://patchwork.kernel.org/patch/1885411/ It has a decent chance to reduce gtt trashing, which might be good enough to again ducttape over the hangs. Or maybe change the pattern to be able to reproduce it much quicker. In any case, should be interesting ... I tried the patch at https://patchwork.kernel.org/patch/1885411/ . After a few hours of use, the system failed, but in a different way. All of a sudden, the graphical desktop became unresponsive. No mouse movement, no keyboard response, keyboard leds could not be toggled, Ctrl-Alt-Backspace did not work. I sshd into the machine, and "top" showed the Xorg process at 99% system time in one CPU. All attempts to kill Xorg failed, even with kill -9. All attempts to attach to the process with gdb hung. A controlled reboot via ssh also hung, so I had to hard-reset the machine. No error state was collected in i915_error_state, and there was no DRI-related backtrace in the error log. (In reply to comment #64) > I tried the patch at https://patchwork.kernel.org/patch/1885411/ . After a > few hours of use, the system failed, but in a different way. All of a > sudden, the graphical desktop became unresponsive. No mouse movement, no > keyboard response, keyboard leds could not be toggled, Ctrl-Alt-Backspace > did not work. I sshd into the machine, and "top" showed the Xorg process at > 99% system time in one CPU. All attempts to kill Xorg failed, even with kill > -9. All attempts to attach to the process with gdb hung. A controlled reboot > via ssh also hung, so I had to hard-reset the machine. No error state was > collected in i915_error_state, and there was no DRI-related backtrace in the > error log. BTW, this was with UXA acceleration, not SNA. (In reply to comment #64) > I tried the patch at https://patchwork.kernel.org/patch/1885411/ [snip] >I sshd into the machine, and "top" showed the Xorg process at > 99% system time in one CPU. Daniel, that's the bug I thought was elsewhere. Basically we evict something to make room, but then fail to find a hole. I thought it was my create top-down that was broken, but there be dragons. Created attachment 71780 [details]
Debian-generated software info
FWIW, I'm seeing the same symptoms (gnome-shell hangs, mouse movable, killall -9 gnome-shell revives)with 2.20.14 on
Linux ding 3.5-trunk-amd64 #1 SMP Debian 3.5.5-1~experimental.1 x86_64 GNU/Linux
with Mesa 8.0.5. I believe this started when I upgraded libdrm, the intel DDX and Mesa a while back. Kernel stayed constant.
Created attachment 71806 [details] [review] make the shrinker less aggressive Duct-tape solution if it is one, but imo very much worth a try. (In reply to comment #68) > Created attachment 71806 [details] [review] [review] > make the shrinker less aggressive > > Duct-tape solution if it is one, but imo very much worth a try. Applying on top of vanilla 3.7.0 and https://patchwork.kernel.org/patch/1885411/ . No luck. Both patches together still result in Xorg spinning and eating all the CPU in system mode after two hours of normal use. I had to hard-reset the machine again. (In reply to comment #70) > No luck. Both patches together still result in Xorg spinning and eating all > the CPU in system mode after two hours of normal use. I had to hard-reset > the machine again. Please test again only with the "make shrinker less aggressive" patch, the former patch seems to be broken somehow and my patch doesn't try to fix that. So same "X stuck spinning" symptoms are still expect. Created attachment 71932 [details] [review] Align surface sizes to an even tile row (In reply to comment #71) > (In reply to comment #70) > > No luck. Both patches together still result in Xorg spinning and eating all > > the CPU in system mode after two hours of normal use. I had to hard-reset > > the machine again. > > Please test again only with the "make shrinker less aggressive" patch, the > former patch seems to be broken somehow and my patch doesn't try to fix > that. So same "X stuck spinning" symptoms are still expect. Running 3.7.0 with "make shrinker less aggressive" patch only. So far, two days without graphics issues. Seems good, but I will keep testing. In a prior test, the machine lasted a week before a graphics crash. xf86-video-intel commit 736b89504a32239a0c7dfb5961c1b8292dd744bd Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Dec 30 10:32:18 2012 +0000 uxa: Align surface allocations to even tile rows Align surface sizes to an even number of tile rows to cater for sampler prefetch. If we read beyond the last page we may catch the PTE in a state of flux and trigger a GPU hang. Also detected by enabling invalid PTE access checking. References: https://bugs.freedesktop.org/show_bug.cgi?id=56916 References: https://bugs.freedesktop.org/show_bug.cgi?id=55984 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk My machine is still running stable after one more day with the less-aggressive-shrinker kernel patch. Meanwhile, 3.8-rc2 is out. Should I try to upgrade to this kernel version? Should I apply the less-aggressive-shrinker kernel patch on this kernel too? BTW, I am not currently testing any of the xf86-video-intel patches. I am still using the distro-supplied version xorg-x11-drv-intel-2.20.8-1.fc16.x86_64 . (In reply to comment #75) > My machine is still running stable after one more day with the > less-aggressive-shrinker kernel patch. Meanwhile, 3.8-rc2 is out. Should I > try to upgrade to this kernel version? Should I apply the > less-aggressive-shrinker kernel patch on this kernel too? There is no patch for this issue upstream yet. And the less-aggressive-shrinker is not the forerunner of potential workaround patches. (The fixed version of #63 is a better choice since it actually fixes a real bug and has a side-effect of also reducing the likelihood of triggering this hang.) > BTW, I am not currently testing any of the xf86-video-intel patches. I am > still using the distro-supplied version > xorg-x11-drv-intel-2.20.8-1.fc16.x86_64 . This is most likely to be the root cause of the problem. So I should be trying an updated xorg-intel driver on top of an *unpatched* kernel? I thought that, since the distro-supplied kernel works fine with the distro-supplied xorg-intel driver, and the failure occurs only if I swap the kernel, it must therefore be a kernel bug. I have successfully compiled xf86-video-intel at fc702cdf534a4694a64408428e8933497a7fc06e and it appears to run correctly under patched 3.7.0 kernel. I will now compile unpatched 3.8-rc2 and see what happens. Bad luck. The unpatched 3.8-rc2 kernel crashed on me just a moment ago, even with the updated xorg-intel driver. Same symptoms as before. Thanks, that is useful to know. Still at a loss to explain this, except that we know it has to do with surface evictions and the processing of the relocation tree. Everyone please retest with latest drm-intel-fixes from http://cgit.freedesktop.org/~danvet/drm-intel I've just merged a bunch of duct-tapes for this issue. Been testing 3.8.0-rc3-00074-gb719f43 under an up to date 64bit Fedora 16, with a gen4 G41 00:02.0 VGA compatible controller: Intel Corporation 4 Series Chipset Integrated Graphics Controller (rev 03) So far the bug seems to be sufficiently hidden again ;) However at some point I have had this. [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state i915: render error detected, EIR: 0x00000010 i915: IPEIR: 0x00000000 i915: IPEHR: 0x01000000 i915: INSTDONE_0: 0xfffffffe i915: INSTDONE_1: 0xffffffff i915: INSTDONE_2: 0x00000000 i915: INSTDONE_3: 0x00000000 i915: INSTPS: 0x0001e000 i915: ACTHD: 0xd2c08eb8 i915: page table error i915: PGTBL_ER: 0x00000002 [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking Though I've not noticed any ill effects yet. Cheers, Andrew Created attachment 72905 [details]
i915 error state from a non hung error state
Attached the i915_error_state to go along with my previous comment in case it's useful
Consolidating all gen4/5 i/o related hangs. *** This bug has been marked as a duplicate of bug 55984 *** A patch referencing this bug report has been merged in Linux v3.8-rc4: commit 93927ca52a55c23e0a6a305e7e9082e8411ac9fa Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu Jan 10 18:03:00 2013 +0100 drm/i915: Revert shrinker changes from "Track unbound pages" Patch merged, closing. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 70081 [details] i915_error_state at time of crash, gzipped System is Fedora 16 x86_64, xorg-x11-server-Xorg-1.11.4-3.fc16.x86_64, xorg-x11-drv-intel-2.20.8-1.fc16.x86_64 gnome-shell-3.2.2.1-1.fc16.x86_64. With distro-supplied kernel (kernel-3.6.6-1.fc16.x86_64), system works correctly. Also works correctly with self-compiled vanilla 3.6.0 kernel. Since vanilla kernel 3.7-rc3 up to current 3.7-rc5, I have been experiencing random crashes of the graphic session. Always, the affected process is gnome-shell. I have currently no known way to induce the crash. The crash occurs randomly - it might happen a few minutes into the session, or it might not happen at all until I turn off the computer. Therefore it is hard for me to perform a bisection. The crash only happens with my work computer (Intel G41 chipsed). My home computer runs 3.7-fc5 x86_64 in the same Fedora 16 setup without incidents, but it is an Intel G31 as far as I can remember.