Summary: | 5.3.11 regression: No RC6 on Kaby Lake | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Tomas Janousek <tomi> | ||||||
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> | ||||||
Status: | RESOLVED MOVED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Severity: | critical | ||||||||
Priority: | highest | CC: | d.r.vanrossum, imre.deak, intel-gfx-bugs, michael, mpagano, tim | ||||||
Version: | XOrg git | Keywords: | bisected, regression | ||||||
Hardware: | x86-64 (AMD64) | ||||||||
OS: | Linux (All) | ||||||||
Whiteboard: | Triaged, ReadyForDev | ||||||||
i915 platform: | BDW, KBL, SKL | i915 features: | GEM/Other | ||||||
Attachments: |
|
Description
Tomas Janousek
2019-11-18 09:47:39 UTC
Created attachment 145988 [details]
dmesg
One additional observation: it's okay (nearly 100% in rc6 according to powertop) until I start Xorg. Then it's 100% powered on, 0% rc6.
Attaching dmesg | grep drm with drm.debug=0xe. Xorg was started at 10:57:04.
Correct; that patch disables rc6 while active to prevent catastrophe. And yes, no rc6 is itself pretty catastrophic. Could you please do something like: $ perf stat -a -x, -r 1 \ -e "power/energy-pkg/" \ -e "power/energy-cores/" \ -e "power/energy-gpu/" \ -e "i915/actual-frequency/" \ -e "i915/rc6-residency/" \ -e "i915/rcs0-busy/" \ -e "i915/bcs0-busy/" \ -e "i915/vcs0-busy/" \ sleep 300 while you do your normal activities, and report before/after? (Trying to do the same activity in each sample.) If you feel daring, you can try https://patchwork.freedesktop.org/series/69591/ good (5.3.11 + revert d4360736a7c0a6326e3bbdf7d41181f6ed03d9a6): 328,85,Joules,power/energy-pkg/,299993660112,100,00,, 65,17,Joules,power/energy-cores/,299993673518,100,00,, 26,13,Joules,power/energy-gpu/,299993681101,100,00,, 91077,MHz,i915/actual-frequency/,299993685777,100,00,, 54314227200,ns,i915/rc6-residency/,299993692616,100,00,, 1944679051,ns,i915/rcs0-busy/,299993699743,100,00,, 0,ns,i915/bcs0-busy/,299993706507,100,00,, 0,ns,i915/vcs0-busy/,299993710255,100,00,, bad (5.3.11): 387,82,Joules,power/energy-pkg/,299995076940,100,00,, 73,07,Joules,power/energy-cores/,299995088838,100,00,, 63,22,Joules,power/energy-gpu/,299995095576,100,00,, 91209,MHz,i915/actual-frequency/,299995099867,100,00,, 0,ns,i915/rc6-residency/,299995106772,100,00,, 966657080,ns,i915/rcs0-busy/,299995113918,100,00,, 0,ns,i915/bcs0-busy/,299995120940,100,00,, 0,ns,i915/vcs0-busy/,299995125062,100,00,, "normal activities" being "screensaver and walk away", but I think that's a good approximation of my normal GPU activity (redrawing the terminal a couple times per second). Not sure I feel daring enough to try those patches. Am I supposed to be able to apply that to 5.3.11 or perhaps compile drm-tip + that as a module for 5.3? (In reply to Tomas Janousek from comment #4) > Not sure I feel daring enough to try those patches. Am I supposed to be able > to apply that to 5.3.11 or perhaps compile drm-tip + that as a module for > 5.3? It's based on our 5.5-tree at present, so, you would have to compile the whole kernel (just use your distro /boot/config-`uname -r`), and it only attempts to enter rc6 faster after activity: https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug112315 There is still a dependency on the background worker to pick up the pieces if userspace is completely idle, so we need to think of ways of running that more often, cheaply -- kicking it off after a completion event? Maybe tie it into only if rc6 is disabled. Hmm, I wonder if we can use something like task_work so that we clean up after userspace on a process switch. Oh, okay. I'm not sure I want to be running 5.4-rc on my daily driver, but if time allows, I might at least give it a try and report how it behaves. I am having this same problem on a Broadwell laptop and a handful of various Skylake systems, so I decided to try https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug112315 (or, precisely, that merged on top of 5.4-rc8 from Linus's tree, which merged cleanly). While it does allow my systems to reach RC6, it doesn't really seem to make a meaningful difference in power consumption. It spends <=10% of the time in RC6 if I have any Firefox windows open, for example, even if Firefox isn't actually doing anything. 5.4-rc7 (or 5.4-rc8 with the DoS patch reverted) would have >=95% RC6 under the same conditions. Michael, you might want to give https://patchwork.freedesktop.org/series/69647/ a try, notably https://patchwork.freedesktop.org/patch/341449/?series=69647&rev=2. (I still didn't get to it but it looks like it might help a bit.) That branch was so yesterday, I've just updated with a more aggressive variant [mostly] posted to intel-gfx@ https://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug112315&id=dc3a7033dab5ceca5ce43ae09d771951e71a904d I just tried again with the more aggressive variant and now I am seeing numbers indistinguishable from before the DoS fix. Thanks! It looks from https://patchwork.freedesktop.org/patch/341449/?series=69647&rev=2 as if there are still problems with the current approach and something else in that DRM tree seems to be breaking HDMI audio for me, so I'm going to have to switch back to 5.4-rc8 with the DoS commit reverted. If you have something new that needs testing I can definitely do that though. Any chance you guys will add a kernel parameter to disable this commit and bring back "unsafe" RC6? Any update on this? Seems like 5.3.11 hit Fedora 30 and 31 stable, so this is affecting all Broadwell and Skylake (particularly laptops) that run it. I temporarily reverted to 5.3.8, but it seems next stable version for fedora 30/31 is gonna be 5.3.12, which makes me ask: has this version any of the fixes/reverts to this issue? Thanks! It was marked as a security issue and was back-ported by most/all distributions. I think you can assume this will be in all future kernels, it would be a brave distribution that reverted a CVE. I hope they are just as fast with the eventual fix. I hate the power impact; I like to feel ok with 6 hours away from power since that happens to me a few times a week. Disabling it is a one line hack to drivers/gpu/drm/i915/i915_drv, e.g. diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index e7b7c5159378..9dd001bf96e6 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -2295,7 +2295,7 @@ IS_SUBPLATFORM(const struct drm_i915_private *i915, #define HAS_BROKEN_CS_TLB(dev_priv) (IS_I830(dev_priv) || IS_I845G(dev_priv)) #define NEEDS_RC6_CTX_CORRUPTION_WA(dev_priv) \ - (IS_BROADWELL(dev_priv) || IS_GEN(dev_priv, 9)) + (IS_BROADWELL(dev_priv) || IS_GEN(dev_priv, 999999)) /* WaRsDisableCoarsePowerGating:skl,cnl */ #define NEEDS_WaRsDisableCoarsePowerGating(dev_priv) \ What's the CVE number? this is from debian changelog so I suppose it is this one. * [x86] i915: Mitigate local privilege escalation on gen9 (CVE-2019-0155): sorry, I think this one: CVE-2019-0154 Step 1: 4f88f8747fa4 ("drm/i915/gt: Schedule request retirement when timeline idles") https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug112315 contains a backport of softrc6 for v5.4 I tried this on a Skylake system and RC6 does work again. However, on the first boot with it, the screen locked up completely and I got the output below. I rebooted again and I haven't seen the problem again yet. Nov 25 13:58:15 D10a329 kernel: [ 75.501215] BUG: unable to handle page fault for address: 0000000000002330 Nov 25 13:58:15 D10a329 kernel: [ 75.501218] #PF: supervisor write access in kernel mode Nov 25 13:58:15 D10a329 kernel: [ 75.501219] #PF: error_code(0x0002) - not-present page Nov 25 13:58:15 D10a329 kernel: [ 75.501220] PGD 0 P4D 0 Nov 25 13:58:15 D10a329 kernel: [ 75.501223] Oops: 0002 [#1] PREEMPT SMP PTI Nov 25 13:58:15 D10a329 kernel: [ 75.501225] CPU: 1 PID: 972 Comm: Xorg Tainted: G U 5.4.0-050400-lowlatency #201911251228 Nov 25 13:58:15 D10a329 kernel: [ 75.501226] Hardware name: LENOVO 10FLS33C04/30D0, BIOS FWKTA5A 09/19/2019 Nov 25 13:58:15 D10a329 kernel: [ 75.501263] RIP: 0010:gen8_emit_flush_render+0x186/0x1b0 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501265] Code: 70 00 00 48 3d 00 f0 ff ff 0f 86 79 ff ff ff e9 28 ff ff ff be 0c 00 00 00 e8 36 70 00 00 48 3d 00 f0 ff ff 0f 87 12 ff ff ff <48> c7 40 08 00 00 00 00 48 83 c0 18 48 c7 40 f8 00 00 00 00 48 c7 Nov 25 13:58:15 D10a329 kernel: [ 75.501266] RSP: 0018:ffffac1d80917a10 EFLAGS: 00010207 Nov 25 13:58:15 D10a329 kernel: [ 75.501268] RAX: 0000000000002328 RBX: 00000000fffff080 RCX: 0000000000003f90 Nov 25 13:58:15 D10a329 kernel: [ 75.501269] RDX: 0000000000002358 RSI: 00000000000000e0 RDI: ffff9915edff9200 Nov 25 13:58:15 D10a329 kernel: [ 75.501270] RBP: ffffac1d80917a20 R08: 0000000000000110 R09: ffff991628db69b0 Nov 25 13:58:15 D10a329 kernel: [ 75.501271] R10: 000000000000a000 R11: ffff991625d95b00 R12: 0000000001144c1c Nov 25 13:58:15 D10a329 kernel: [ 75.501272] R13: ffff9916256e6800 R14: 0000000000000cc0 R15: ffff99162a5c36c0 Nov 25 13:58:15 D10a329 kernel: [ 75.501273] FS: 00007f8db6c86a80(0000) GS:ffff99162ea80000(0000) knlGS:0000000000000000 Nov 25 13:58:15 D10a329 kernel: [ 75.501274] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 25 13:58:15 D10a329 kernel: [ 75.501275] CR2: 0000000000002330 CR3: 000000042608e001 CR4: 00000000003606e0 Nov 25 13:58:15 D10a329 kernel: [ 75.501276] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 25 13:58:15 D10a329 kernel: [ 75.501277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Nov 25 13:58:15 D10a329 kernel: [ 75.501278] Call Trace: Nov 25 13:58:15 D10a329 kernel: [ 75.501305] execlists_request_alloc+0x4a/0x140 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501333] __i915_request_create+0x212/0x270 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501360] i915_request_create+0x7b/0xd0 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501387] i915_gem_do_execbuffer+0x6d3/0xc80 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501411] ? irq_enable.part.0+0x3c/0x40 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501415] ? dma_fence_remove_callback+0x49/0x60 Nov 25 13:58:15 D10a329 kernel: [ 75.501441] ? i915_request_wait+0x1d5/0x3d0 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501466] ? irq_execute_cb+0x30/0x30 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501469] ? __kmalloc_node+0x24b/0x330 Nov 25 13:58:15 D10a329 kernel: [ 75.501493] i915_gem_execbuffer2_ioctl+0x1db/0x3c0 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501517] ? i915_gem_busy_ioctl+0x88/0x1e0 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501542] ? i915_gem_madvise_ioctl+0x176/0x2b0 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501566] ? i915_gem_execbuffer_ioctl+0x2c0/0x2c0 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501578] drm_ioctl_kernel+0xae/0xf0 [drm] Nov 25 13:58:15 D10a329 kernel: [ 75.501588] drm_ioctl+0x234/0x3d0 [drm] Nov 25 13:58:15 D10a329 kernel: [ 75.501614] ? i915_gem_execbuffer_ioctl+0x2c0/0x2c0 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501617] do_vfs_ioctl+0x405/0x660 Nov 25 13:58:15 D10a329 kernel: [ 75.501620] ? __fget+0x77/0xa0 Nov 25 13:58:15 D10a329 kernel: [ 75.501621] ksys_ioctl+0x67/0x90 Nov 25 13:58:15 D10a329 kernel: [ 75.501623] __x64_sys_ioctl+0x1a/0x20 Nov 25 13:58:15 D10a329 kernel: [ 75.501626] do_syscall_64+0x57/0x190 Nov 25 13:58:15 D10a329 kernel: [ 75.501629] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Nov 25 13:58:15 D10a329 kernel: [ 75.501630] RIP: 0033:0x7f8db6fe467b Nov 25 13:58:15 D10a329 kernel: [ 75.501632] Code: 0f 1e fa 48 8b 05 15 28 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e5 27 0d 00 f7 d8 64 89 01 48 Nov 25 13:58:15 D10a329 kernel: [ 75.501633] RSP: 002b:00007ffd325c02d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Nov 25 13:58:15 D10a329 kernel: [ 75.501635] RAX: ffffffffffffffda RBX: 00007ffd325c0320 RCX: 00007f8db6fe467b Nov 25 13:58:15 D10a329 kernel: [ 75.501635] RDX: 00007ffd325c0320 RSI: 0000000040406469 RDI: 000000000000000e Nov 25 13:58:15 D10a329 kernel: [ 75.501636] RBP: 0000000040406469 R08: 000055c491fbc790 R09: 0000000000000000 Nov 25 13:58:15 D10a329 kernel: [ 75.501637] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c491f797c0 Nov 25 13:58:15 D10a329 kernel: [ 75.501638] R13: 000000000000000e R14: ffffffffffffffff R15: 00007f8db65cce08 Nov 25 13:58:15 D10a329 kernel: [ 75.501640] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge stp llc md4 cmac nls_utf8 cifs libarc4 fscache libdes overlay snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio binfmt_misc intel_rapl_msr nls_iso8859_1 intel_rapl_common x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass snd_hda_intel snd_intel_nhlt snd_hda_codec snd_usb_audio snd_hda_core snd_usbmidi_lib mc snd_hwdep crct10dif_pclmul snd_pcm crc32_pclmul snd_seq_midi ghash_clmulni_intel snd_seq_midi_event snd_rawmidi mei_hdcp i915 snd_seq aesni_intel drm_kms_helper crypto_simd snd_seq_device cryptd snd_timer glue_helper intel_cstate hid_plantronics input_leds intel_rapl_perf drm snd wmi_bmof joydev i2c_algo_bit intel_wmi_thunderbolt mei_me fb_sys_fops syscopyarea soundcore sysfillrect sysimgblt mei acpi_pad mac_hid nct6683 Nov 25 13:58:15 D10a329 kernel: [ 75.501665] coretemp parport_pc ppdev lp parport iTCO_wdt iTCO_vendor_support ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq libcrc32c hid_generic usbhid hid e1000e i2c_i801 wmi ahci libahci video Nov 25 13:58:15 D10a329 kernel: [ 75.501675] CR2: 0000000000002330 Nov 25 13:58:15 D10a329 kernel: [ 75.501678] ---[ end trace 7ed3c4bcf4278660 ]--- Nov 25 13:58:15 D10a329 kernel: [ 75.501703] RIP: 0010:gen8_emit_flush_render+0x186/0x1b0 [i915] Nov 25 13:58:15 D10a329 kernel: [ 75.501705] Code: 70 00 00 48 3d 00 f0 ff ff 0f 86 79 ff ff ff e9 28 ff ff ff be 0c 00 00 00 e8 36 70 00 00 48 3d 00 f0 ff ff 0f 87 12 ff ff ff <48> c7 40 08 00 00 00 00 48 83 c0 18 48 c7 40 f8 00 00 00 00 48 c7 Nov 25 13:58:15 D10a329 kernel: [ 75.501706] RSP: 0018:ffffac1d80917a10 EFLAGS: 00010207 Nov 25 13:58:15 D10a329 kernel: [ 75.501707] RAX: 0000000000002328 RBX: 00000000fffff080 RCX: 0000000000003f90 Nov 25 13:58:15 D10a329 kernel: [ 75.501708] RDX: 0000000000002358 RSI: 00000000000000e0 RDI: ffff9915edff9200 Nov 25 13:58:15 D10a329 kernel: [ 75.501709] RBP: ffffac1d80917a20 R08: 0000000000000110 R09: ffff991628db69b0 Nov 25 13:58:15 D10a329 kernel: [ 75.501710] R10: 000000000000a000 R11: ffff991625d95b00 R12: 0000000001144c1c Nov 25 13:58:15 D10a329 kernel: [ 75.501711] R13: ffff9916256e6800 R14: 0000000000000cc0 R15: ffff99162a5c36c0 Nov 25 13:58:15 D10a329 kernel: [ 75.501712] FS: 00007f8db6c86a80(0000) GS:ffff99162ea80000(0000) knlGS:0000000000000000 Nov 25 13:58:15 D10a329 kernel: [ 75.501713] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 25 13:58:15 D10a329 kernel: [ 75.501714] CR2: 0000000000002330 CR3: 000000042608e001 CR4: 00000000003606e0 Nov 25 13:58:15 D10a329 kernel: [ 75.501715] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 25 13:58:15 D10a329 kernel: [ 75.501716] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Created attachment 146023 [details]
/sys/class/drm/card0/error after a GPU hang
I also just got a GPU hang, the output from which I have attached.
The individual patches look ok, so it looks like I assumed that v5.4 i915_request_retire() was ready to be called without struct_mutex held. That turns out to be a mistake! Next iteration at https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug112315 version https://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug112315&id=21234379ea5ae5af001539362c01f0888b4cf81a Thanks! So far this one is working well so far on a Skylake system and a Broadwell system. I haven't had a chance to test the specific one that was crashing before, but I will have more information on that tomorrow. Several hours of testing on the computer where I first encountered the crashes and hangs has also been completely problem free. With the previous patchset, it would have likely crashed 4-5 times during that period, so it looks fixed to me. Thanks! -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/614. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.