Created attachment 112259 [details] dmesg ==System Environment== -------------------------- Regression: not sure Non-working platforms: BDW ==kernel== -------------------------- drm-intel-nightly/9c4bdce37d09c0682f04bb5e6d0567def5c8d786 commit 9c4bdce37d09c0682f04bb5e6d0567def5c8d786 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Tue Jan 13 23:27:51 2015 +0100 drm-intel-nightly: 2015y-01m-13d-22h-27m-23s UTC integration manifest ==Bug detailed description== ----------------------------- It sporadically causes system hang, fail rate: 1/5. output: IGT-Version: 1.9-g5fb26d1 (x86_64) (Linux: 3.19.0-rc4_drm-intel-nightly_9c4bdc_20150114+ x86_64) dmesg: [ 499.462482] BUG: unable to handle kernel paging request at 0000000100000088 [ 499.462532] IP: [<ffffffffa01093d9>] capture_bo+0x4/0x14d [i915] [ 499.462583] PGD a1cdd067 PUD 0 [ 499.462610] Oops: 0000 [#1] SMP [ 499.462636] Modules linked in: netconsole configfs ipv6 iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi ppdev dm_mod pcspkr i2c_i801 snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer lpc_ich mfd_core snd soundcore battery parport_pc parport ac acpi_cpufreq i915 button video drm_kms_helper drm cfbfillrect cfbimgblt cfbcopyarea [last unloaded: netconsole] [ 499.462917] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.19.0-rc4_drm-intel-nightly_9c4bdc_20150114+ #410 [ 499.462971] task: ffff880149bb1800 ti: ffff880149bc0000 task.ti: ffff880149bc0000 [ 499.463016] RIP: 0010:[<ffffffffa01093d9>] [<ffffffffa01093d9>] capture_bo+0x4/0x14d [i915] [ 499.463075] RSP: 0018:ffff88014ec43ce0 EFLAGS: 00010203 [ 499.463106] RAX: 0000000100000000 RBX: ffff8800a1de4000 RCX: ffff8800956817e0 [ 499.463151] RDX: ffff880095698020 RSI: ffff880143c8fa80 RDI: ffff8800956817c0 [ 499.463191] RBP: ffff880143c8fa00 R08: ffff880143c8fa80 R09: ffff8800a7d5dd08 [ 499.463232] R10: 0000000000000c01 R11: ffffea000255d820 R12: ffff880144610000 [ 499.463273] R13: 0000000000000008 R14: 0000000000000002 R15: 00000000000000bf [ 499.463315] FS: 0000000000000000(0000) GS:ffff88014ec40000(0000) knlGS:0000000000000000 [ 499.463361] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 499.463395] CR2: 0000000100000088 CR3: 00000000a1cdc000 CR4: 00000000003407e0 [ 499.463437] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 499.463484] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 499.463531] Stack: [ 499.463548] ffffffffa00ac483 0000000000000300 ffff880143c8fa80 ffff8800a7d5dd08 [ 499.463614] ffff880095698020 ffff880095680000 ffff880100000c01 ffff880100000005 [ 499.463684] ffff880143c8fae0 ffff880144617b30 ffff880144436000 ffff8800956817e0 [ 499.463742] Call Trace: [ 499.463760] <IRQ> [ 499.463778] [<ffffffffa00ac483>] ? i915_capture_error_state+0x681/0x1382 [i915] [ 499.463856] [<ffffffffa00b4460>] ? i915_handle_error+0x7a/0x599 [i915] [ 499.463906] [<ffffffffa00b4cd0>] ? i915_hangcheck_elapsed+0x305/0x399 [i915] [ 499.465609] [<ffffffffa00b49cb>] ? i915_queue_hangcheck+0x4c/0x4c [i915] [ 499.467303] [<ffffffff8107bf89>] ? call_timer_fn+0x46/0xe2 [ 499.469006] [<ffffffffa00b49cb>] ? i915_queue_hangcheck+0x4c/0x4c [i915] [ 499.470703] [<ffffffff8107c3ce>] ? run_timer_softirq+0x1af/0x212 [ 499.472389] [<ffffffff8103eb14>] ? __do_softirq+0xdc/0x22f [ 499.474077] [<ffffffff8103ed9b>] ? irq_exit+0x34/0x78 [ 499.475761] [<ffffffff81026b8e>] ? smp_apic_timer_interrupt+0x39/0x43 [ 499.477451] [<ffffffff8179fdaa>] ? apic_timer_interrupt+0x6a/0x70 [ 499.479151] <EOI> [ 499.479167] [<ffffffff81009f18>] ? default_idle+0x34/0x8e [ 499.482553] [<ffffffff8106554c>] ? cpu_startup_entry+0x170/0x2e0 [ 499.484261] Code: c7 3d 33 12 a0 e8 3d a0 f0 ff b8 fb ff ff ff eb 09 b8 00 00 00 00 41 0f 4e c4 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e c3 48 8b 46 48 <48> 8b 90 88 00 00 00 89 17 8b 90 90 00 00 00 89 57 04 48 8b 90 [ 499.486263] RIP [<ffffffffa01093d9>] capture_bo+0x4/0x14d [i915] [ 499.488079] RSP <ffff88014ec43ce0> [ 499.489862] CR2: 0000000100000088 [ 499.491650] ---[ end trace 04d70c09331bf34f ]--- ==Reproduce steps== ---------------------------- 1. ./gem_evict_everything --run-subtest swapping-hang
add BSW in this bug.
Chris, did you mean this issue is platform interrelated for you removed the platform 'BDW'?
It's a bug in the capture code that is not specific to any architecture.
(In reply to Chris Wilson from comment #3) > It's a bug in the capture code that is not specific to any architecture. (The issue is magnified by the partial seqno/request conversion.)
*** Bug 88821 has been marked as a duplicate of this bug. ***
*** Bug 89441 has been marked as a duplicate of this bug. ***
Does development team agree this as highest priority? If so can we move on?
Lost track here ... have we merged the patches Chris?
No, it is something that I addressed in the conversion to requests but has been overlooked.
Reducing bug priority after a discussion with Chris. Main points are - the bug is not a regression, it has been in the code base since the introcution of lockless error capture; - there is no user sighting of the bug; - the blocked test case (gem_evict_everything/swapping-hang) tests for an extreme corner-case. Also, according to Chris, the for the solution "we need a couple of spinlocks to serialize bo retirement vs error capture, but we need to avoid creating deadlocks, and that is the tricky part."
Created attachment 117649 [details] HSW-ULT_dmesg.txt Hi, this issue also occurs with the latest configuration for HSW-ULT -- Hardware -- Platform: Intel NUC D54250WYK Processo: Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz -- Software -- Linux distribution: Ubuntu 14.04.02 LTS 64Bits BIOS: WYLPT10H.86A.0021.2013.1017.1606 Test Environment: ```````````````````````````````````` Kernel: tag drm-intel-testing-2015-07-31 (4.2-rc4) from git://anongit.freedesktop.org/drm-intel Mesa: mesa-10.6.3 from http://cgit.freedesktop.org/mesa/mesa/ Xf86_video_intel: 2.99.917 from http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/ Libdrm: libdrm-2.4.62 from http://cgit.freedesktop.org/mesa/drm/ Cairo: 1.14.2 from http://cgit.freedesktop.org/cairo libva: libva-1.6.0 from http://cgit.freedesktop.org/libva/ intel-driver: 1.6.0. from http://cgit.freedesktop.org/vaapi/intel-driver xorg: 1.17.99 installed with script git_xorg.sh Xserver: xorg-server-1.17.2 from http://cgit.freedesktop.org/xorg/xserver Intel-gpu-tools: 1.11 from http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/ Notes : It often causes system hang. Fail rate : 4/5, and sometimes causes dmesg warning Attached HSW-ULT_dmesg.txt If needed more information or you have any doubt do not hesitate to contact me
Created attachment 117672 [details] BDW-U dmesg log
Bug scrub: Probably fixed, can you confirm?
No. Error capture still dereferences requests without any serialisation with the freeing of said requests.
Created attachment 118888 [details] BDW dmesg log Bug Scrub: Tested again on BDW using kernel 4.3.0 and got an error as well, find attached the dmesg log and find below the Environment I used ```````````````````````````````````` Kernel:4.3.0-rc4 drm-intel-testing-2015-10-10 Mesa: mesa-11.0.2 Xf86_video_intel: 2.99.917 Libdrm: libdrm-2.4.65 Cairo: 1.14.2 libva: libva-1.6.1 intel-driver: 1.6.1 xorg: 1.17.99 installed with script git_xorg.sh Xserver: xorg-server-1.17.2 Intel-gpu-tools: 1.12
Bug scrub, Assigned to Kimmo
http://patchwork.freedesktop.org/patch/70010/
(In reply to Chris Wilson from comment #17) > http://patchwork.freedesktop.org/patch/70010/ Can anybody please confirm whether the patch above solves the problem or at least reduces the failure rate? Thanks, Paulo
Jairo, please re-test with the patch and confirm if it is still occuring.
Seems that the patch is not valid for drm-intel-next-2016-05-08-2069-gf1eaed1.. equivalent for drm-intel-testing-05-21-2016. The file i915_gpu_error.c is not taking the patches. Hunk #3 FAILED at 1290. 1 out of 3 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_gpu_error.c.rej (04:05 AM) [gfx@gfx-ThinkCentre-M600] [drm-intel]$ : nano drivers/gpu/drm/i915/i915_gpu_error.c.rej GNU nano 2.5.3 File: drivers/gpu/drm/i915/i915_gpu_error.c.rej --- drivers/gpu/drm/i915/i915_gpu_error.c +++ drivers/gpu/drm/i915/i915_gpu_error.c @@ -1290,9 +1269,19 @@ void i915_capture_error_state(struct drm_device *dev, bo$ } kref_init(&error->ref); - error->i915 = dev_priv; - stop_machine(capture, error, NULL); + i915_capture_gen_state(dev_priv, error); + i915_capture_reg_state(dev_priv, error); + i915_gem_record_fences(dev, error); + i915_gem_record_rings(dev, error); + + i915_capture_active_buffers(dev_priv, error); + i915_capture_pinned_buffers(dev_priv, error); + + do_gettimeofday(&error->time); +
(In reply to Chris Wilson from comment #17) > http://patchwork.freedesktop.org/patch/70010/ HI Chris, this patch we could not apply im the latest kernels 4.7.0-rc7, could you do a double check please?
Well, we are getting closer it is only at about patch 90 in the queue now. The patch in situ is https://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=tasklet&id=c9a8be989704c323a87c2fd661b3a65815daa938
This test is now being skipped due to "lack of memory", I tested in BXT and SKL using the following Kernel: =================================================================== commit 57de27e40b9741c17c6749a366e891faf8b22fcb Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Date: Mon Aug 29 17:38:46 2016 +0200 drm-intel-nightly: 2016y-08m-29d-15h-38m-26s UTC integration manifest =================================================================== I am getting the following message IGT-Version: 1.15-g572a770 (x86_64) (Linux: 4.8.0-rc4drm-intel-nighly-ww35-commi 64) Test requirement not met in function intel_require_memory, file intel_os.c:289: Test requirement: __intel_check_memory(count, size, mode, &required, &total) Estimated that we need 201,326,592 objects and 201,424,896 MiB for the test, but 89 MiB available (RAM) and a maximum of 1,611,544 objects Notice the " estimated " memory required is an abnormal amount of memory.
(In reply to Jairo Miramontes from comment #23) > I am getting the following message > > IGT-Version: 1.15-g572a770 (x86_64) (Linux: > 4.8.0-rc4drm-intel-nighly-ww35-commi 64) > Test requirement not met in function intel_require_memory, file > intel_os.c:289: > Test requirement: __intel_check_memory(count, size, mode, &required, &total) > Estimated that we need 201,326,592 objects and 201,424,896 MiB for the test, > but 89 MiB available (RAM) and a maximum > of 1,611,544 objects > > > Notice the " estimated " memory required is an abnormal amount of memory. But accurate. That test is irrelevant regarding this bug. The bug is a race condition in our error capture code that only depends upon running the error capture whilst the driver is active.
commit 9f267eb8d2ea0a87f694da3f236067335e8cb7b9 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Oct 12 10:05:19 2016 +0100 drm/i915: Stop the machine whilst capturing the GPU crash dump
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.