Summary: | [KVM][GVT-d] [BDW & SKL ]Ubuntu 16.04 guest boot up with kernel panic with the newest 4.9.0+ drm-intel kernel | ||
---|---|---|---|
Product: | DRI | Reporter: | Terrence Xu <terrence.xu> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED DUPLICATE | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | blocker | ||
Priority: | highest | CC: | dorota.czaplejewicz, gordon.jin, intel-gfx-bugs, jani.saarinen, tomeu, xiong.y.zhang, zhiyuan.lv |
Version: | DRI git | Keywords: | bisected, regression |
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | BDW, SKL | i915 features: | |
Attachments: |
Description
Terrence Xu
2016-12-08 09:17:51 UTC
Created attachment 128382 [details]
dmesg-guest.log
Attach the full guest dmesg log.
Use the newest drm-intel-testing tag (drm-intel-testing-2016-12-26), this issue still exist. Ubuntu guest dmesg as below: [ 0.519993] kvm: no hardware support^M [ 1.696007] [drm:intel_sbi_read] *ERROR* timeout waiting for SBI to complete read transaction^M [ 1.799010] [drm:intel_sbi_write] *ERROR* timeout waiting for SBI to complete write transaction^M [ 6.903747] BUG: unable to handle kernel NULL pointer dereference at 0000000000000070^M [ 6.904648] IP: reset_common_ring+0xc3/0x130^M [ 6.905065] PGD 366dc067 ^M [ 6.905065] PUD 365c8067 ^M [ 6.905339] PMD 0 ^M [ 6.905643] ^M [ 6.905998] Oops: 0000 [#1] PREEMPT SMP^M [ 6.906378] Modules linked in: e1000^M [ 6.906821] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 4.9.0+ #7^M [ 6.907426] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014^M [ 6.908686] Workqueue: events_long i915_hangcheck_elapsed^M [ 6.909217] task: ffff88007d279b80 task.stack: ffffc900000a8000^M [ 6.909841] RIP: 0010:reset_common_ring+0xc3/0x130^M [ 6.910317] RSP: 0018:ffffc900000abb88 EFLAGS: 00010286^M [ 6.910866] RAX: 0000000080000000 RBX: ffff880036e56000 RCX: 0000000000002030^M [ 6.911563] RDX: 00000000ffffffff RSI: ffffffff81549593 RDI: 0000000000000000^M [ 6.912305] RBP: ffffc900000abba8 R08: ffffffff81ad8b20 R09: ffffffff81cbbd7c^M [ 6.913037] R10: 0000000000000000 R11: 0000000000000040 R12: ffff88007c444000^M [ 6.913771] R13: ffff88007d312600 R14: ffff880079bb0000 R15: ffff880036e56000^M [ 6.914466] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000^M [ 6.915298] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M [ 6.915927] CR2: 0000000000000070 CR3: 0000000036234000 CR4: 00000000003406f0^M [ 6.920229] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M [ 6.920989] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M [ 6.921716] Call Trace:^M [ 6.921968] i915_gem_reset+0x248/0x3c0^M [ 6.922360] ? _raw_spin_unlock_irqrestore+0xe/0x10^M [ 6.922860] ? __irq_put_desc_unlock+0x1e/0x40^M [ 6.923308] i915_reset+0xdd/0x160^M [ 6.923666] i915_reset_and_wakeup+0xe9/0x150^M [ 6.924098] i915_handle_error+0x1a0/0x210^M [ 6.924512] ? scnprintf+0x3d/0x70^M [ 6.924872] hangcheck_declare_hang+0xcb/0xf0^M [ 6.925315] ? intel_engine_get_active_head+0xb4/0xe0^M [ 6.925853] i915_hangcheck_elapsed+0x27f/0x2b0^M [ 6.926326] process_one_work+0x13d/0x4a0^M [ 6.926773] worker_thread+0x48/0x4e0^M [ 6.927141] ? _raw_write_unlock_irqrestore+0x2e/0x60^M [ 6.927632] ? preempt_count_sub+0x4c/0x80^M [ 6.928084] kthread+0x101/0x140^M [ 6.928416] ? process_one_work+0x4a0/0x4a0^M [ 6.928877] ? kthread_create_on_node+0x40/0x40^M [ 6.929327] ret_from_fork+0x2a/0x40^M [ 6.929675] Code: 41 5e 5d c3 41 8b 44 24 20 4c 89 f7 b9 01 00 00 00 ba 00 00 ff ff 8d b0 a0 03 00 00 41 ff 96 58 07 00 00 49 8b bc 24 80 02 00 00 <48> 8b 47 70 48 39 43 70 74 43 48 85 ff 74 06 3e 83 2f 01 74 50 ^M [ 6.931595] RIP: reset_common_ring+0xc3/0x130 RSP: ffffc900000abb88^M [ 6.932258] CR2: 0000000000000070^M [ 6.932588] ---[ end trace 30ecd9ef57e73e63 ]---^M could some i915 developer look into this issue? This is blocking GVT-d (a.k.a graphics pass-through). There are two issues here. The first is a general memory corruption in gvt and the second is invalid gvt emulation. Terrence, what kernel configs.modules are you using? Did you check with any other host kernel (e.g. stock Ubuntu)? If the problem persists, host/guest commit numbers would be nice. I still can reproduce this issue with using the newest drm-tip code as Ubuntu guest kernel. Host repo: kvm.git commit: 0c744ea Linux 4.10-rc2 Guest repo: drm-intel.git commit: 0f01216 drm-tip: 2017y-02m-02d-19h-49m-15s UTC integration manifest [ 0.516920] kvm: no hardware support^M [ 1.692850] [drm:intel_sbi_read] *ERROR* timeout waiting for SBI to complete read transaction^M [ 1.795851] [drm:intel_sbi_write] *ERROR* timeout waiting for SBI to complete write transaction^M [ 6.900039] BUG: unable to handle kernel NULL pointer dereference at 0000000000000070^M [ 6.900956] IP: reset_common_ring+0x9a/0x100^M [ 6.901386] PGD 36248067 ^M [ 6.901386] PUD 36247067 ^M [ 6.901680] PMD 0 ^M [ 6.902128] ^M [ 6.902539] Oops: 0000 [#1] PREEMPT SMP^M [ 6.902957] Modules linked in: e1000^M [ 6.903326] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 4.10.0-rc6+ #8^M [ 6.904035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014^M [ 6.905240] Workqueue: events_long i915_hangcheck_elapsed^M [ 6.905830] task: ffff88007d279b80 task.stack: ffffc900000a8000^M [ 6.906425] RIP: 0010:reset_common_ring+0x9a/0x100^M [ 6.906950] RSP: 0018:ffffc900000abb98 EFLAGS: 00010246^M [ 6.907478] RAX: 0000000000000000 RBX: ffff880036e54000 RCX: 0000000000000008^M [ 6.908225] RDX: 0000000000003fd8 RSI: ffff880079bc8000 RDI: 0000000000000000^M [ 6.908987] RBP: ffffc900000abbb0 R08: 0000000000000001 R09: ffffc900100010a0^M [ 6.909700] R10: ffff88003670d188 R11: 0000000000000040 R12: ffff88007d30f600^M [ 6.910451] R13: ffff88007c420000 R14: ffff88007d30f600 R15: 000000000000001a^M [ 6.911199] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000^M [ 6.912046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M [ 6.912616] CR2: 0000000000000070 CR3: 0000000036249000 CR4: 00000000003406f0^M [ 6.913376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M [ 6.914132] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M [ 6.914846] Call Trace:^M [ 6.915135] i915_gem_reset_finish+0x229/0x3a0^M [ 6.915583] ? intel_uncore_forcewake_put+0x48/0x60^M [ 6.916112] i915_reset+0xd5/0x160^M [ 6.916453] i915_reset_and_wakeup+0xe9/0x150^M [ 6.916901] i915_handle_error+0x1a0/0x210^M [ 6.917345] ? scnprintf+0x3d/0x70^M [ 6.917684] hangcheck_declare_hang+0xcb/0xf0^M [ 6.918173] ? intel_engine_get_active_head+0xb4/0xe0^M [ 6.918668] i915_hangcheck_elapsed+0x27f/0x2b0^M [ 6.919172] process_one_work+0x13d/0x4a0^M [ 6.919576] worker_thread+0x48/0x4e0^M [ 6.919961] ? _raw_write_unlock_irqrestore+0x2e/0x60^M [ 6.920506] ? preempt_count_sub+0x4c/0x80^M [ 6.920920] kthread+0x101/0x140^M [ 6.921288] ? process_one_work+0x4a0/0x4a0^M [ 6.921705] ? kthread_create_on_node+0x40/0x40^M [ 6.922209] ret_from_fork+0x31/0x40^M [ 6.922570] Code: 48 8b 83 80 00 00 00 c7 40 3c ff ff ff ff 48 8b bb 80 00 00 00 e8 97 3b 00 00 8b 05 75 b6 9f 00 85 c0 75 5d 49 8b bd 88 02 00 00 <48> 8b 47 70 48 39 43 70 74 3d 48 85 ff 74 06 3e 83 2f 01 74 48 ^M [ 6.924526] RIP: reset_common_ring+0x9a/0x100 RSP: ffffc900000abb98^M [ 6.925153] CR2: 0000000000000070^M [ 6.925526] ---[ end trace 9c68741eebecd572 ]---^M Created attachment 129311 [details]
kernel config file
Attach the kernel config file.
(In reply to Terrence Xu from comment #7) > Created attachment 129311 [details] > kernel config file > > Attach the kernel config file. This is the guest kernel config file for drm-intel. Created attachment 129312 [details]
dmesg-guest-20170203-drm-tip: 2017y-02m-02d-19h-49m-15s
Attach the newest guest dmesg guest log with panic.
This is GVT-d pass-though by unbind GPU from i915 driver and bind GPU to a vfio-pci device. And QA didn't use "i915.enable_gvt=1" from neither host or guest side. So no GVT code is involved. And this issue is a regression since 4.9.0. The tested 4.8.0-rc2+ doesn't have this issue. Suggest i915 team to help to investigate this regression. Regression = we need the bisect. I'm having trouble reproducing this so far. Seeing that the guest breaks on i915 functions, gvt-d is enabled. What's the host configuration needed for that? Does the issue happen when the display manager is not started? (In reply to Terrence Xu from comment #9) > Created attachment 129312 [details] > dmesg-guest-20170203-drm-tip: 2017y-02m-02d-19h-49m-15s > > Attach the newest guest dmesg guest log with panic. Can you boot with slub_debug and attach the whole dmesg? Created attachment 129481 [details]
dmesg-guest-slubdebug-20170203-drm-tip: 2017y-02m-02d-19h-49m-15s
Here is the guest dmesg log for "Boot up guest with slub_debug=FPZU,kmalloc-1024”.
(In reply to Terrence Xu from comment #14) > Created attachment 129481 [details] > dmesg-guest-slubdebug-20170203-drm-tip: 2017y-02m-02d-19h-49m-15s > > Here is the guest dmesg log for "Boot up guest with > slub_debug=FPZU,kmalloc-1024”. Sorry, but that log isn't really that useful. It's not complete, please attach the *whole* kernel output (first line should start with "Linux version"). Please make sure the cmd line args include drm.debug=0xe. Please use slub_debug without any further options, or if you have a good reason to think those should be enough, please explain. Also, I think it would be good if this bug contained more detailed instructions on how to reproduce the problem. The first bad commit as below: commit 821ed7df6e2a1dbae243caebcfe21a0a4329fca0 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Sep 9 14:11:53 2016 +0100 drm/i915: Update reset path to fix incomplete requests Update reset path in preparation for engine reset which requires identification of incomplete requests and associated context and fixing their state so that engine can resume correctly after reset. The request that caused the hang will be skipped and head is reset to the start of breadcrumb. This allows us to resume from where we left-off. Since this request didn't complete normally we also need to cleanup elsp queue manually. This is vital if we employ nonblocking request submission where we may have a web of dependencies upon the hung request and so advancing the seqno manually is no longer trivial. ABI: gem_reset_stats / DRM_IOCTL_I915_GET_RESET_STATS We change the way we count pending batches. Only the active context involved in the reset is marked as either innocent or guilty, and not mark the entire world as pending. By inspection this only affects igt/gem_reset_stats (which assumes implementation details) and not piglit. ARB_robustness gives this guide on how we expect the user of this interface to behave: * Provide a mechanism for an OpenGL application to learn about graphics resets that affect the context. When a graphics reset occurs, the OpenGL context becomes unusable and the application must create a new context to continue operation. Detecting a graphics reset happens through an inexpensive query. And with regards to the actual meaning of the reset values: Certain events can result in a reset of the GL context. Such a reset causes all context state to be lost. Recovery from such events requires recreation of all objects in the affected context. The current status of the graphics reset state is returned by enum GetGraphicsResetStatusARB(); The symbolic constant returned indicates if the GL context has been in a reset state at any point since the last call to GetGraphicsResetStatusARB. NO_ERROR indicates that the GL context has not been in a reset state since the last call. GUILTY_CONTEXT_RESET_ARB indicates that a reset has been detected that is attributable to the current GL context. INNOCENT_CONTEXT_RESET_ARB indicates a reset has been detected that is not attributable to the current GL context. UNKNOWN_CONTEXT_RESET_ARB indicates a detected graphics reset whose cause is unknown. The language here is explicit in that we must mark up the guilty batch, but is loose enough for us to relax the innocent (i.e. pending) accounting as only the active batches are involved with the reset. In the future, we are looking towards single engine resetting (with minimal locking), where it seems inappropriate to mark the entire world as innocent since the reset occurred on a different engine. Reducing the information available means we only have to encounter the pain once, and also reduces the information leaking from one context to another. v2: Legacy ringbuffer submission required a reset following hibernation, or else we restore stale values to the RING_HEAD and walked over stolen garbage. v3: GuC requires replaying the requests after a reset. v4: Restore engine IRQ after reset (so waiters will be woken!) Rearm hangcheck if resetting with a waiter. Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Arun Siluvery <arun.siluvery@linux.intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20160909131201.16673-13-chris@chris-wilson.co.uk (In reply to Tomeu Vizoso from comment #15) > (In reply to Terrence Xu from comment #14) > > Created attachment 129481 [details] > > dmesg-guest-slubdebug-20170203-drm-tip: 2017y-02m-02d-19h-49m-15s > > > > Here is the guest dmesg log for "Boot up guest with > > slub_debug=FPZU,kmalloc-1024”. > > Sorry, but that log isn't really that useful. > > It's not complete, please attach the *whole* kernel output (first line > should start with "Linux version"). > > Please make sure the cmd line args include drm.debug=0xe. > > Please use slub_debug without any further options, or if you have a good > reason to think those should be enough, please explain. > > Also, I think it would be good if this bug contained more detailed > instructions on how to reproduce the problem. After I set drm.debug=0xe and slub_debug=FMZU, I got the same logs as the above attachment. And actually it is the full log I can fetched, since it is the guest dmesg log not host dmesg log. I added the console=ttyS0,115200,8n1 in guest grub. In host, I boot up guest as below: modprobe kvm modprobe kvm_intel modprobe vfio modprobe vfio_pci echo "0000:00:02.0" > /sys/bus/pci/devices/0000:00:02.0/driver/unbind echo "8086 1626" > /sys/bus/pci/drivers/vfio-pci/new_id ('8086 1626' generated by 'lspci -n -s 00:02.0') qemu-system-x86_64 -enable-kvm -vga cirrus -m 2048 -hda /home/testrunner/ubuntu-16.04.img -device vfio-pci,host=00:02.0,id=hostdev0,bus=pci.0,addr=0x6 -usb -usbdevice tablet -net nic,macaddr=00:AA:BB:AB:DE:00 -net tap,script=/etc/qemu-ifup -serial stdio -cpu host > 3.log 2>&1 & Created attachment 129601 [details]
dmesg-guest-full-20170203-drm-tip: 2017y-02m-02d-19h-49m-15s
Finally I fetched the full guest dmesg logs!
As attachment: dmesg-guest-full-20170203-drm-tip: 2017y-02m-02d-19h-49m-15s
Created attachment 129619 [details]
dmesg-guest-full-20170214-drm-tip: 2017y-02m-14d-22h-44m-17s
Update the full guest dmesg log for drm-nightly-2017-02-14 version.
(In reply to Terrence Xu from comment #19) > Created attachment 129619 [details] > dmesg-guest-full-20170214-drm-tip: 2017y-02m-14d-22h-44m-17s > > Update the full guest dmesg log for drm-nightly-2017-02-14 version. Thanks! Today won't be able to get back to this, but in the meantime, could you please see what happens when you boot with i915.enable_rc6=0? Also, could you figure out which code line is causing the oops? (In reply to Tomeu Vizoso from comment #20) > (In reply to Terrence Xu from comment #19) > > Created attachment 129619 [details] > > dmesg-guest-full-20170214-drm-tip: 2017y-02m-14d-22h-44m-17s > > > > Update the full guest dmesg log for drm-nightly-2017-02-14 version. > > Thanks! > > Today won't be able to get back to this, but in the meantime, could you > please see what happens when you boot with i915.enable_rc6=0? With the same result and the same error log as previously. > Also, could you figure out which code line is causing the oops? The null pointer is triggered by function "reset_common_ring" Line #1395 in "intel_lrc.c", as below: if (request->ctx !=port[0].request->ctx) , the port[0].request->ctx is Null. I can confirm the bug, at the same commit. One thing required to reproduce the bug is intel_iommu=on on host command line. *** This bug has been marked as a duplicate of bug 99028 *** |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.